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Description 

il. RELATED APPLiCATiONS 

5 [0001] This application isacontinuation-in-partof U.S. Application Seriai No. 60/141,521 flledJune25, 1999, which 
Is incorporated by reference herein. 

III. FIELD OF THE INVENTION 

10 [0002] The Invention relates to the field of genomics, and genetics. Including genome analysis and the study of DNA 
variation, in particular, the Invention relates to the fields of phannacogenetics and pharmacogenenomics and the use 
of genetic haplotype Information to predict an individual's susceptibility to disease and/or their response to a particular 
drug or drugs, so that drugs tailored to genetic differences of population groups may be developed and/or administered 

to the appropriate population. 

^5 [0003] The Invention also relates to tools to analyze DNA, catalog variations In DNA, study gene function and llnl< 
variations In DNA to an Individual's susceptibility to a particular disease and/or response to a particular drug or drugs. 
[0004] The invention may also be used to link variations in DNA to personal identity and racial or ethnic bacl<ground. 
[0005] The invention aiso relates to the use of haplotype Information in the veterinary and agricultural fields. 

20 IV. BACKGROUND OF THE INVENTION 

[0006] The accumulation of genomic Infonnation and technology Is opening doors for the discovery of new diagnos- 
tics, preventive strategies, and drug therapies for a whole host of diseases, including diabetes, hypertension, heart 
disease, cancer, and mental Illness. This Isduetothefactthat many human diseases have genetic components, which 

25 may be evidenced by clustering In certain families, and/or in certain racial, ethnic or ethnogeographic (world population) 
groups. For example, prostrate cancer clusters in somefamllies. Furthermore, while prostate cancer Is common among 
all U.S. males, it Is especially common among African American men. They are 35 percent more lil<ely than Americans 
of European descent to develop the disease and more than twice as iii<eiy to die from It. A variation on chromosome 
1 (HPC1 ) and a variation on the X chromosome (HPCX) appear to predispose men to prostrate cancer and a study is 

30 currently underway to test this hypothesis. 

[0007] Lii<ewise, it is clear that an Individual's genes can have considerable Influence over how that Individual re- 
sponds to a particular drug or drugs. 

[0008] Individuals inherit specific versions of enzymes that affect how they metabolize, absorb and excrete drugs. 
So far, researchers have identified several dozen enzymes that vary in their activity throughout the population and that 

35 probably dictate people's response to drugs - which may be good, bad or sometimes deadly. For example, the cyto- 
chrome P450 family of enzymes (of which CYP 2D6 Is a member) Is involved In the metabolism of at least 20 percent 
of all commonly prescribed drugs, Including the antidepressant Prozac ™, the palnl<lller codeine, and high-blood-pres- 
sure medications such as captoprli. Ethnic variation is also seen in this instance. Due to genetic differences in cyto- 
chrome P450, for example, 6 to 10 percent of Whites, 5 percent of Biaci<s, and less than 1 percent of Asians are poor 

40 drug metabolizers. 

[0009] One very troubling observation is that adverse reactions often occur In patients receiving a standard dose of 
a particular drug. As an example, doctors in the 1 950s would administer a drug called succinyicholine to induce muscle 
relaxation in patients before surgery. A number of patients, however, never woi<e up from anesthesia - the compound 
paralyzed their breathing muscles and they suffocated. It was later discovered that the patients who died had inherited 

45 a mutant form of the enzyme that clears succinyicholine from their system. As another example, as early as the 1 940s 
doctors noticed that certain tuberculosis patients treated with the antibacterial drug isonlazld would feel pain, tingling 
and weakness in their limbs. These patients were unusually slow to clear the drug from their bodies - Isonlazld must 
be rapidly converted to a nontoxic forni by an enzyme called N-acetyitransferase. This difference in drug response 
was later discovered to be due to differences in the gene encoding the enzyme. The number of people who would 

50 experience adverse responses using this drug is not small. Forty to sixty per cent of Caucasians have the less active 
form of the enzyme (I.e., "slow acetyiators"). 

[0010] Another gene encodes a liver enzyme that causes side effects in some patients who used Seidane^", an 
allergy drug which was removed from the market. The drug SeldaneT" is dangerous to people with liver disease, on 
antibiotics, or who are using the antifungal drug Nizoral. The major problem with Seldane™ is that It can cause serious, 
55 potentially fatal, heart rhythm disturbances when more than the recommended dose Is taken. The real danger Is that 
It can Interact with certain other drugs to cause this problem at usual doses. It was discovered that people with a 
particular version of a CYP450 suffered serious side effects when they took Seldane^" with the antibiotic erythromycin. 
[001 1] Sometimes one ethnic group is affected more than others. During the Second World War, for example, Af rican- 
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American soldiers given the antimalarial drug primaquine developed a severe form of anaemia. The soldiers who 
became ill had a deficiency in an enzyme called glucose-6 -phosphate dehydrogenase (G6PD) due to agenetic variation 
that occurs in about 10 per cent of Africans, but very rarely In Caucasians. G6PD deficiency probably became more 
common in Africans because it confers some protection against malaria. 

5 [0012] Variations in certain genes can also determine whether a drug treats a disease effectively. For example, a 
cholesterol-lowering drug called pravastatin won't help people with high blood cholesterol if they have a common gene 
variant for an enzyme called cholesteryl ester transfer protein (CETP). As another example, several studies suggest 
that the version of the "ApoE" gene that Is associated with a high risk of developing Alzheimer's disease in old age (i. 
e., AP0E4) correlates with a poor response to an Alzheimer's drug called tacrine. As yet another example, the drug 

10 Herceptin ™, a treatment for metastatic breast cancer, only works for patients whose tumors overproduce a certain 
protein, called HER2. A screening test is given to all potential patients to weed out those on whom the drug won't be 
effective. 

[0013] In summary, it is well known that not all individuals respond identically to drugs for a given condition. Some 
people respond well to drug A but poorly to drug B, some people respond better to drug B, while some have adverse 
is reactions to both drugs. In many cases it is currently difficult to tell how an individual person will respond to a given 
drug, except by having them try using it. 

[001 4] It appears that a major reason people respond differently to a drug is that they have different forms of one or 
more of the proteins that interact with the drug or that lie in the cascade initiated by taking the drug. 
[0015] A common method for determining the genetic differences between individuals Is to find Single Nucleotide 
20 Polymorphisms (SNPs), which may be either in or near a gene on the chromosome, that differ between at least some 
individuals in the population. A number of instances are known (Sickle Cell Anemia is a prototypical example) for which 
the nucleotide at a SNP is correlated with an individual's propensity to develop a disease. Oten these SNPs are linked 
to the causative gene, but are not themselves causative. These are often called surrogate markers for the disease. 
The SNP/surrogate marker approach suffers from at least three problems: 

25 

(1) Comprehensiveness: There are often several polymorphisms in any given gene. (See Ref. 10 for an example 
in which there are 88 polymorphic sites). Most SNP projects look at a large number of SNPs, but spread over an 
enormous region of the chromosome. Therefore the probability of finding all (or any) SNPs in the coding region of 
a gene is small. The likelihood of finding the causative SNP(s)(thesubset of polymorphisms responsible for causing 

30 a particular condition or change in response to a treatment) is even lower. 

(2) Lack of Linkage: If the causative SNP is In so-called linkage disequilibrium (Ref 1 , Chapter 2) with the measured 
SNP, then the nucleotide at the measured SNP will be correlated with the nucleotide at the causative SN P. However 
it is impossible to predict a pnon whether such linkage disequilibrium will exist for a particular pair of measured 
and causative SNPs. 

35 (3) Phasing: When there are multiple, interacting causative SNPs in a gene one needs to know what are the 

sequences of the two forms of the gene present in an individual. For instance, assume there is a gene that has 3 
causative SNPs and that the remaining part of the gene is identical among all individuals. We can then Identify the 
two copies of the gene that any individual has with only the nucleotides at those sites. Now assume that 4 forms 
exist in the population, labeled TAA, ATA, TTA and AAA. SNP methods effectively measure SNPs one at a time, 

40 and leave the "phasing" between nucleotides at different positions ambiguous. An individual with one copy of TAA 

and one of ATA would have a genotype (collection of SNPs) of [T/A, T/A, A/A]. This genotype is consistent with 
the haplotypes TTA/AAA or TAA/ATA. An Individual with one copy of TTA and one of AAA would have exactly the 
same genotype as an individual with one copy of TAA and one copy of ATA. By using unphased genotypes, we 
cannot distinguish these two individuals. 

45 

[0016] A relatively low density SNP based map of the genome will have little likelihood of specifically identifying drug 
target variations that will allow for distinguishing responders from poor responders, non-responders, or those likely to 
suffer side-effects (or toxicity) to drugs. A relatively low density SNP based map of the genome also will have little 
likelihood of providing information for new genetically based drug design. In contrast, using the data and analytical 
50 tools of the present invention, knowing all the polymorphisms in the haplotypes will provide a firm basis for pursuing 
phamnacogenetics of a drug or class of drugs. 

[0017] With the present invention, by knowing which fornis of the proteins an individual possesses, in particular, by 
knowing that individual's haplotypes (which are the most detailed description of their genetic makeup for the genes of 
interest) for rationally chosen drug target genes, or genes intimately involved with the pathway of interest, and by 
55 knowing the typical response for people with those haplotypes, one can with confidence predict how that individual will 
respond to a drug. Doing this has the practical benefit that the best available drug and/or dose for a patient can be 
prescribed immediately rather than relying on a trial and error approach to find the optimal drug. The end result is a 
reduction in cost to the health care system. Repeat visits to the physician's office are reduced, the prescription of 
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needless drugs is avoided, and the number of adverse reactions is decreased. 

[0018] The Ciinical Trials Solution (CTS^") method described herein provides a process for finding con-elation's 
between haplotypes and response to treatment and for developing protocols to test patients and predict their response 

to a particular treatment. 

5 [0019] The CTS™ method is partially embodied in the DecoGen™ Platform, which is a computer program coupled 
to a database used to display and analyze genetic and clinical information. It includes novel graphical and computational 
methods for treating haplotypes, genotypes, and clinical data in a consistent and easy-to-interpret manner. 

V. SUMMARY OF THE INVENTION 

10 

[0020] The basis of the present invention is the fact that the specific form of a protein and the expression pattern of 
that protein in a particular individual are directly and unambiguously coded for by the individual's isogenes. which can 
be used to determine haplotypes. These haplotypes are more informative than the typically measured genotype, which 
retains a level of ambiguity about which form of the proteins will be expressed In an Individual. By having unambiguous 

is Information about the forms of the protein causing the response to a treatment, one has the ability to accurately predict 
Individuals' responses to that treatment. Such Information can be used to predict drug efficacy and toxic side effects, 
lower the cost and risk of clinical trials, redefine and/or expand the markets for approved compounds (I.e., existing 
drugs), revive abandoned drugs, and help design more effective medications by identifying haplotypes relevant to 
optimal therapeutic responses. Such information can also be used, e.g., to detemilnethe correct drug dose to give a 

20 patient. 

[0021] At the molecular level, there will be a direct correlation between the form and expression level of a protein 
and its mode or degree of action. By combining this unambiguous molecular level information (i.e., the haplotypes) 
with clinical outcomes (e.g. the response to a particular drug), one can find correlations between haplotypes and out- 
comes, These correlations can then be used in a forward-looking mode to predict Individuals' response to a drug, 
25 [0022] The invention also relates to methods of making informative linkages between gene Inheritance, disease 
susceptibility and how organisms react to drugs. 

[0023] The Invention relates to methods and tools to Individually design diagnostic tests, and therapeutic strategies 
for maintaining health, preventing disease, and improving treatment outcomes, in situations where subtle genetic dif- 
ferences may contribute to disease risk and response to particular therapies. 

30 [0024] The method and tools of the invention provide the ability to determine the frequency of each isogene. In 
particular, its haplotype. In the major ethno-geographic groups, as well as disease populations. 
[0025] Similarly, in agricultural biotechnology, the method and tools of the invention can be used to detennine the 
frequency of Isogenes responsible for specific desirable traits, e.g., drought tolerance and/or Improved crop yields, 
and reduce the time and effort needed to transfer desirable traits. 

35 [0026] The invention includes methods, computer program(s) and database(s) to analyze and make use of gene 
haplotype Information. These Include methods, program, and database to find and measure thefrequency of haplotypes 
In the general population; methods, program, and database to find correlation's between an individuals' haplotypes or 
genotypes and a clinical outcome; methods, program, and database to predict an Individual's haplotypes from the 
Individual's genotype for a gene; and methods, program, and database to predict an individual's clinical response to a 

40 treatment based on the individual's genotype or haplotype. 

[0027] The invention also relates to methods of constructing a haplotype database for a population, comprising: 

(a) identifying individuals to include in the population; 

(b) determining haplotype data for each individual in the population from isogene information; 
45 (c) organizing the haplotype data for the Individuals in the population Into fields; and 

(d) storing the haplotype data for Individuals in the population according to the fields. 

[0028] The invention also relates to methods of predicting the presence of a haplotype pair in an individual comprising, 
in order; 

50 

(a) identifying a genotype for the individual; 

(b) enumerating all possible haplotype pairs which are consistent with the genotype; 

(c) accessing a database containing reference haplotype pair frequency data to determine a probability, for each 
of the possible haplotype pairs, that the individual has a possible haplotype pair; and 

55 (d) analyzing the determined probabilities to predict haplotype pairs for the individual. 

[0029] The invention also relates to methods for identifying a correlation between a haplotype pair and a clinical 
response to a treatment comprising: 
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(a) accessing a database containing data on clinical responses to treatments exhibited by a clinical population; 

(b) selecting a candidate locus hypothesized to be associated with the clinical response, the locus comprising at 
least two polymorphic sites; 

(c) generating hapiotype data for each member of the clinical population, the haplotypedata comprising information 

5 on a piurality of pciymcrphic sites present in the candidate locus; 

(d) storing the hapiotype data; and 

(e) identifying the correlation by analyzing the hapiotype and clinical response data 

[0030] The invention also relates to methods for identifying a correlation between a hapiotype pair and susceptibility 
10 to a disease comprising the steps of: 

(a) selecting a candidate locus hypothesized to be associated with the condition or disease, the locus comprising 
at ieast two poiymorphic sites; 

(b) generating hapiotype data for the candidate locus for each member of a disease popuiation; 
is (c) organizing the hapiotype data in a database; 

(d) accessing a database containing reference haplotypes for the candidate iocus; 

(e) identifying the correlation by analyzing the disease hapiotype data and the reference hapiotype data wherein 
when a hapiotype pair has a higher frequency in the disease population than in the reference population, a corre- 
lation of the hapiotype pair to a susceptibility to the disease is Identified. 

20 

[0031] The invention also relates to methods of predicting response to a treatment comprising: 

(a) selecting at least one candidate gene which exhibits a correlation between hapiotype content and at least two 
different responses to the treatment; 
25 (b) determining a hapiotype pair of an individual for the candidate gene; 

(c) comparing the individuai's hapiotype pair with stored information on the correlation; and 

(d) predicting the Individual's response as a result of the comparing. 

[0032] The invention also provides computer systems which are programmed with program code which causes the 
30 computer to carry out many of the methods of the invention. A range of computer types may be employed; suitable 
computer systems include but are not limited to computers dedicated to the methods of the invention, and general- 
purpose programmable computers. The invention further provides computer-usable media having computer-readable 
program code stored thereon, for causing a computer to carry out many of the methods of the invention. Computer- 
usable media includes, but is not limited to, solid-state memory chips, magnetic tapes, or magnetic or optical disl<s. 
35 The invention also provides database structures which are adapted for use with the computers, program code, and 
methods of the invention. 

VI. BRIEF DESCRIPTION OF THE DRAWINGS 

40 [0033] 

FIGURE 1 . System Architecture Schematic. 

FIGURE 2. Pathway/Gene Collection View. This screen shows a schematic of candidate genes from which a 
candidate gene may be selected to obtain further information. A menu on the left of the screen indicates some of 
45 the infonnation about the candidate genes which may be accessed from a database. 



TNFR1 - 


Tissue Necrosis Factor 1 


ADBR2 - 


Beta-2 Adrenergic Receptor 


iGERA- 


immunoglobulin E receptor alpha chain 


iGERB- 


immunoglobulin E receptor beta chain 


OCIF- 


osteoclastogenesis inhibitory factor 


ERA- 


Estrogen alpha receptor 


IL-4R- 


interleul<in 4 receptor 


5HT1A- 


5 hydroxytryptamine receptor 1A 


DRD2- 


dopamine receptor D2 


TNFA- 


tumor necrosis factor alpha 


IL-1B- 


interleukin IB 


PTGS2 - 


prostaglandin synthase 2 (COX-2) 
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IL-4 - interleukin4 

IL-13- interleukin 13 

CYP2D6 - cytochrome P450 2D6 

HSERT - serotonin transporter 

UCP3 - uncoupling protein 3 



FiGURES.Gene Description View. This screen provides some of the basic infomation about the currently selected 
gene. 

FIGURE 4A. Gene Structure View. This screen shows the location of features in the gene (such as promoter, 
10 introns, exons, etc.), the location of polymorphic sites in the gene for each haplotype and the number of times 

each haplotype was seen in various world population groups. 

FIGU RE 4B. Gene Structure View (Cent.). This screen shows a screen which results after a gene feature is selected 
in the screen of FIGURE 4A. An expanded view of the selected gene feature is shown at the bottom of the screen. 
FIGURE 5. Sequence Alignment View. This screen shows an alignment of the full DNA sequences for all the 
is haplotypes (i.e., the isogenes) which appears In a separate window when one of the features in FIGURE 4A or 

4B Is selected. The polymorphic positions are highlighted. 

FIGURE 6. mRNA Structure View. This screen shows the secondary structure of the RNA transcript for each 
isogene of the selected gene. 

FIGURE 7. Protein Structure View. This screen shows important motifs in the protein. The location of polymorphic 
20 sites in the protein is indicated by triangles. Selecting a triangle brings up infomiation aboutthe selected polymor- 

phism at the top of the screen. 

FIGURE 8. Population View. This screen shows information about each of the members of the population being 
analyzed. PID is a unique Identifier. 

FIGURE 9, SNP Distribution View. This screen shows the genotype to haplotype resolution of each of the individuals 
25 In the population being examined. 

FIGURE 10. Haplotype Frequencies (Summary View). This screen shows a summary of ethnic distribution as a 
function of haplotypes. 

FIGURE 11. Haplotype Frequencies (Detailed View). This screen shows details of ethnic distribution as a function 
of haplotype. Numerical data is provided. 
30 FIGURE 12. Polymorphic Position Linkage View. This screen shows linkage between polymorphic sites in the 
population. 

FIGURE 13. Genotype Analysis View (Summary View). This screen shows haplotyping identification reliability 
using genotyping at selected positions. 

FIGURE 14. Genotype Analysis View (Detailed View). This screen gives a number value for the graphical data 
35 presented in FIGURE 13. 

FIGURE 15. Genotype Analysis View (Optimization View). This screen gives the results of a simple optimization 

approach to finding the simplest genotyping approach for predicting an Individual's haplotypes. 

FIGURES 16 and 17, Haplotype Phylogenetic Views. These screens show minimal spanning networks for the 

haplotypes seen In the population. 
40 FIGURE 1B. Clinical Measurements vs. Haplotype View (Summary). This screen shows a matrix summarizing the 

correlation between clinical measurements and haplotypes. 

FIGURE 1 9. Clinical Measurements vs. Haplotype View (Distribution View). This screen shows the distribution of 
the patients in each cell of the matrix of FIGURE 1 8. 

FIGURE 20. Expanded view of one haplotype-pair distribution. This screen results when a user selects a cell in 
45 the matrix in FIGURE 19. The screen shows the number of patients in the various response bins Indicated on the 

horizontal axis. 

FIGURE 21 . Linear Regression Analysis View. This screen shows the results of a dose-response linear regression 
calculation on each of the Individual polymorphisms 

FIGURE 22. Clinical Measurements vs. Haplotype View (Details). This screen gives the mean and standard de- 
50 viation for each of the ceils in FIGURE 1 8. 

FIGURE 23. Clinical Measurement ANOVA calculation. This screen shows the statistical significance between 
haplotype pair groups and clinical response. 

FIGURE 24. Interface to the DecoGen CTS Modeler. As described in the text, a genetic algorithm (GA) is used to 
find an optimal set of weights to fit a function of the subject haplotype data to the clinical response. The controls 
55 at the right of the page are used to set the number of GA generations, the size of the population of "agents" that 

coevolve during the GA simulation, and the GA mutation and crossover rates. The GA population, and population 
parameters with those of the real human subjects, should not be confused. These are simply tenns used in the 
computational algorithm which is the GA. The GA is an error-minimizing approach, where the error is a weighted 
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sum of differences between the predicted ciinicai response and that which is measured. The graph in the top- 
middle shows the residuai error as a function of computational time, measured in generations. The bar graph at 
the bottom center shows the weights from Equation 6 for the best solution found so far in the GA simulation. 
FIGURE 25A. Gene Repository data submodel. 

5 FIGURE 25B. Population Repository data submodel. 

FIGURE 25C. Polymorphism Repository data submodel. 
FIGURE 25D. Sequence Repository data submodel. 
FIGURE 25E. Assay Repository data submodel. 
FIGURE 25R Legend of symbols in FIGURES 25A-E. 

10 FIGURE 26. Pathway View. This screen shows a schematic of candidate genes relevant to asthma from which a 

candidate gene may be selected to obtain further information. This view is an alternativeway of showing information 
similar to that described in the Pathway/Gene Collection View shown in FIGURE 2, with access to additional views, 
projects and other information, as well as additional tools. A menu on the left of the screen In FIGURE 26 indicates 
some of the information about the candidate genes which may be accessed from a database. The candidates 

is genes shown are 



ADBR2 - Beta-2 Adrenergic Receptor 

IL-9 - lnterleul<in 9 

PDE6B - Phosphodiesterase 68 

CALM1 - Calmodulin 1 

JAK3 ■ Janus Tyrosine Kinase 3 



The following is a description about what happens (or could be made to happen) when each of the Items on 
top of the screens (e.g., "File", "Edif, "Subsets", "Action", "Tools", "Help") are selected: 

25 

• File: 



New 
Open 



"File" lets the viewer select the ability to open or save a project file, which contains a list of genes 
to be viewed. 



Cut 
Copy 

40 Paste 



• Subsets: 

"Subsets" allows the user to create and select for analysis subsets of the total patient set. Once a subset 
has been defined and named, the nameof the subset goes into the pulldown under this menu. Functions are 
45 available to select a subset of patients based on clinical value ("Select everyone with a choleserol level > 

200"),-or ethnicity, or genetic makeup ("Select all patients with haplotype CAGGCTGG for gene DAXX"), etc. 

• Action: 

Redo 

so "Redo" will cause displays to be regenerated when, for instance, the active set of SNPs has been 

changed. 

• Tools: 

"Tools" will bring up various utilities, such as a statistics calculator for calculating etc. 

55 

• Help: 

"Help" will bring up on-line help for various functions. 
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The following is a description of the Standard Buttons that occur on all screens; 
New (blanl< sheet)- standard windows button for creating new file - this creates a new project 
Open (open folder) - standard windows button for opening existing file - open an existing project 
Save (picture of floppy disk) - save the current project to a file 

Save 2"*^ version - save the currently selected set of idividuals or genes to a collection that can be separately 
analyzed. 

Print (picture of printer) - print the current page 

Cut (scissors) - delete the selected items (could be a gene or genes, a person, a SNP, etc., depending on the 
context) 

Copy - copy the selected item (as above) to the clipboard 
Paste - paste the contents of the clipboard to the current view 
X - currently not used 

New 2 (next biani< page Icon) - create a subset (genes, people, etc) from the selected items in the view 

Recalculate (icon of calculator) - redo computation of statistics, etc., depending on the context. 

Help (question marl<) - bring up on-line help for the current view. 

The following is a description of Buttons that show up on several views: 

Expand (magnifying glass with -i- sign) - zoom In on the graphical display - Increase In size 

Shrinl< (magnifying glass with - sign) - zoom out on the graphical display - decrease in size 

FIGURE 27. Genelnfo View. This screen provides some of the basic infonnatlon about the cun-entiy selected 
ADRB2 gene This screen is an alternative way of showing information similar to that described in the Gene De- 
scription View In FIGURES. 

FIGURE 28A. GeneStructure View, This screen shows the location of features in the gene (such as promoter, 
Introns, axons, etc.), the location of polymorphic sites in the gene for each hapiotype and the number of times 
each hapiotype was seen in various world population groups for the ADRB2 gene. This screen is an alternative 
way of showing information similar to that described in the Gene Structure View in FIGURE 4A. 
FIGURE 28B. GeneStructure View (Cent.). This screen shows a screen which results after a gene feature is se- 
lected in the screen of FIGURE 28A. This screen is an alternative way of showing infonnatlon similar to that de- 
scribed in the Gene Structure View in FIGURE 4B. An expanded view of the nucleotide sequence flanking the 
selected polymorphic site Is shown at the top of the screen. This portion of the screen provides access to some 
of the same information as shown In FIGURE 5 (Sequence Alignment View). 

FIGURE 29A. Patient Table View/Patient Cohort View, This screen shows genotype and hapiotype Information 
about each of the members of the patient population being analyzed. Family relationships are also shown, when 
such Information is present. Families 1333 and 1047 shown In FIGURE 29A are the families that were analyzed 
for this gene. In this particular screen, if other families had been analyzed, they would appear with those shown, 
but below, where one would scroll down. "Subject" is a unique identifier. The patients' genotypes are shown in the 
top right panel. At the far left of this panel (not seen until one scrolls over) are the indices for the two haplotypes 
that a patient has. These indices refer to the hapiotype table at the bottom right. The left hand panel shows the 
hapiotype Ids for families that have been analyzed as part of a cohort. The haplotypes must follow Mendelian 
Inheritance pattern. I.e., one copy form his mother and one from his father. For instance If an individual's mother 
had haplotypes 1 and 2 and his father had haplotypes 3 and 4, then that individual must have one of the following 
pairs; (1 ,3), (1 ,4), (2,3) or (2,4). This panel is used to check the accuracy of the hapiotype detemiination method 
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FIGURE 29B. Clinical Trial Data View. This screen shows gives the values of all of the clinical measurements for 
each Individual in FIGURE 29A. 

FIGURE 30. HAPSNP View. This screen shows the genotype to haplotype resolution of the ADRB2 gene for each 
of the Individuals In the population being examined. This view provides similar information as that shown in the 
SNP Distribution View of FIGURE 9. 

FIGURE 31 . HAPPair View. This screen shows asummary of ethnic distribution of haplotypes of the ADRB2 gene. 
This view is an alternativeway of showing information simiiarto that shown in the Haplotype Frequencies (Summary 
View) of FIGURE 1 0. The "V/D" (i.e., View Details) button in this view allows the user to toggle between the views 
shown In FIGURES 31 and 32. 

FIGURE 32. HAP Pair View (HAP Pair Frequency View). This screen shows details of ethnic distribution as a 
function of haplotypes of the ADRB2 gene. Numerical data Is provided. This view is an alternative way of showing 
Information simiiarto that shown in the Haplotype Frequencies (Detailed View) of FIGURE 11 for the CPY2D6 
gene. The V/D button has the same function as in FIGURE 31 . 

FIGURE 33. Linkage View. This screen shows linkage between polymorphic sites in the population for the ADRB2 
gene. This view is an alternative way of showing infonnatlon simiiarto that shown in FIGURE 12 for the CPY2D6 
gene. 

FIGURE 34. HAPTyping View. This screen shows the reliability of haplotyping identification using genotyping at 
selected positions for the ADRB2 gene. This view Is an alternativeway of showing Infonnatlon similar to that shown 
in the Genotype Analysis Views of FIGU RES 1 3, 1 4 and 1 5 for the CPY2D6 gene. This view is the Interface to the 
automated method for determining the minimal number of SNPs that must be examined In order to determine the 
haplotypes for a population. See "Step 6", Section D(1) and Example 2, herein, for details of this method. The view 
shows all pairs of haplotypes and their corresponding genotypes and finally the frequency of the genotype. The 
Inset (which one sees by scrolling to the right) shows the best scoring set of SNPs to score, along with a quality 
score (scores<1 ) are acceptable. The pairs of numbers In brackets are the genotypes that are still indistinguishable 
given this SNP set. "Population" in the box in the top of the figure Is equivalent to the "Subset" selection menu 
described above. Populations and subsets are the same. One subset is the total analyzed population, 
FIGURE 35. Phylogenetic View. These screens show minimal spanning networks for the haplotypes seen in the 
population for the ADRB2 gene. This view Is an alternative way of showing Information similar to that shown In 
FIGURES 16 and 17 for the CPY2D6 gene. This view also provides a window containing haplotype and ethnic 
distribution information. The numbers next to the balls represent the haplotype number and the numbers Inside 
the parentheses represent the number of people In the analyzed population that have that haplotype. The function 
of the calculator button (or a red/green flag button, not shown In this view) is the same as recalculate In FIGURES 
16 and 17. In this case it arranges nodes according to evolutionary distance. 

FIGURE 36, Clinical Haplotype Correlations View (Summary). This screen shows a matrix summarizing the cor- 
relation between clinical measurements and haplotypes for the ADRB2 gene. This view Is an alternative way of 
showing information similar to that shown in FIGURE ISforthe CPY2D6 gene. 
Buttons are as described for FIGURES 26 and as follows: 

• Graph (Icon of graph) - does a statistics calculation and brings up a statistics results window, such as FIGURE 
39A. 

• Nomial (Icon of bell curve) - does a HAPpaIr ANOVA calculation - a specialized statistical calculation. 

• 3 finger down Icon - displays a graph showing a histogram of clinical data for individuals with specific genetic 
markers. 

• Thermometer - shows a list of clinical variables for the user to select from for display and analysis. 

Some of the viewing modes obtainable by selecting the following drop-down menus on this view (and the 
other views on which they appear) are: 



Linear 

Log 
Log 10 
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Summary 
Distribution 
Details 
Quantile 



Regression 
ANOVA 

Case Control 
ANCOVA 
Response IVIodei 



FIGURE 37. Clinical Measurements vs. Haplotype View (Distribution View). This screen shows the distribution of 
is the patients in each cell of the matrix of FIGURE 36. This view is an alternative way of showing information similar 

to that shown in FIGURE 1 9 for the CPY2D6 gene. Drop-down menus and buttons are as described for FIGURE 36. 
FIGURE 38. Expanded Clinical Distribution View. This screen shows an expanded view of one haplotype-pair 
distribution. This screen results when a user selects a cell in the matrix in FIGURE 37. The screen shows the 
number of patients In the various response bins Indicated on the horizontal axis. This view is an alternative way 
20 of showing information similar to that shown in FIGURE 20 for the CPY2D6 gene, and also displays additional 

information. 

FIGURE 39A, DecoGen Single Gene Statistics Calculator (Linear Regression Analysis View). This screen shows 
the results of a dose-response linear regression calculation on each of the shown individual polymorphisms or 
subhaplotypes with respect to the clinical measure "Delta % FEV1 pred." The SNPs and subhaplotypes shown 

25 are those selected as significant in the build-up procedure described below. This view is an alternative way of 

showing information similar to that shown in FIGURE 21 fortheCPY2D6 gene and the "test" measurement, with 
additional Information. The numbers in the boxes nextto "Confidence" and "Fixed Site" In FIGURE 39A are default 
values for these parameters, but can be changed by the user. After they are changed, the user must clicl< the 
"Redo" or "Recalculate" button (the little calculator icon) the regenerate the statistic with the new parameters. The 

30 first two boxes hold the tight and loose cutoffs for the snp-to-hap buildup procedure we have already discussed. 
The "Fixed site" value says how far the buildup can proceed, a value of "4" says produce subhaplotypes with no 
more that 4 non-* sites. The minus sign says to also do the full-haplotype build down procedure. Detecting the 
Show/IHide button allows the user to toggle between modes where all examined con-eiations are displayed and 
where only those passing the tight statistical criteria are displayed. 

35 FIGURE 39B. Regression for Delta %FEV1 Pred. View. This view shows the regression line response as a function 

of number of copies of haplotype **a*****A*G**. 

FIGURE 40, Clinical Measurements vs. Haplotype View (Details). This screen gives the mean and standard de- 
viation for each of the cells in FIGURE 36. This view Is an alternative way of showing some of the Information 
similar to that shown in FIGURE 22 forthe CPY2D6 gene and the "test" measurement. 
40 FIGURE 41. Clinical Measurement ANOVA calculation. This screen shows the statistical significance between 

haplotype pair groups and clinical response for the Hap pairs forthe ADRB2 gene. This view is an alternative way 
of showing some of the Information similar to that shown in FIGU RE 23 for the CPY2D6 gene and the "test" meas- 
urement. 

FIGURE 42. Cinical Variables View. This figure simply shows histogram distributions for each of the clinical vari- 
es ables. This is the same as Figure 38, but not selected by haplotype pair A clinical measurement is chosen by 
selecting one of the lines in the top list. 

FIGURE 43. Clinical Correlations View. This view allows one to see the correlation between any pair of clinical 
measurements. The user selects one measurement from the list on the left, which becomes the x-axis, and one 
from the list on the right, which becomes the y-axis. Each point on the bottom graph represents one individual in 
so the clinical cohort. 

FIGURE 44A. Genomic Repository data submodel. This is a preferred alternative model to the submodels shown 
in FIGURES 25A and 25D. 

FIGURE 44B. Clinical Repository data submodel. This is a preferred alternative submodel to that shown in FIGU RE 

25B. 

55 FIGURE 44C. Variation Repository data submodel. This is an alternative submodel to that shown in FIGURE 25C. 

FIGURE 44D. Literature Repository data submodel. This incorporates some of the tables from the gene repository 
submodel shown in FIGURE 25A. 

FIGURE 44E. Drug Repository data submodel. This is an alternative submodel to that shown in FIGURE 25E. 
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FIGURE 44F. Legend of symbols in FIGURES 44A-E. 

FIGURE 45. Flowchart. This is aflow chartfor a multi-SNP analysis method of associating phenotypes (such as 
clinical outcomes) with haplotypes (also called a "build-up" procedure). 

FIGURE 46. Flow Chart. This is a flow chart for a reverse-SNP analysis method of associating phenotypes (such 
5 as clinical outcomes) with haplotypes (also called a "pare-down" procedure). 

FIGURE 47. Diagram of a process for assembling a genomic sequence by a human or a computer. 
FIGURE 48. Diagram of a process for generating and displaying a gene structure. 
FIGURE 49. Diagram of a process of generating and displaying a protein structure. 

10 VII. DETAILED DESCRIPTION OF THE INVENTION 

A. DEFINITIONS 

[0034] The following definitions are used herein: 

15 

Allele - A particular form of a genetic locus, distinguished from other forms by its particular nucleotide sequence. 
Ambiguous polymorphic site - A heterozygous polymorphic site or a polymorphic site for which nucleotide se- 
quence Information Is lacking. 

Candidate Gene - A gene which Is hypothesized or known to be responsible for a disease, condition, or the 
20 response to a treatment, or to be con-elated with one of these. 

Full Polymorphic Set - The polymorphic set whose members are a sequence of all the known polymorphisms. 
Full-genotype - The unphased 5' to 3' sequence of nucleotide pairs found at all known polymorphic sites In a 
locus on a pair of homologous chromosomes in a single Individual. 

Gene - A segment of DNA that contains all the Infonnatlon for the regulated biosynthesis of an RNA product, 
25 including promoters, exons, introns, and other untranslated regions that control expression. 

Gene Feature - A portion of the gene such as, e.g., a single exon, a single intron, a particular region of the 5' or 
3'-untranslated regions. The gene feature Is always associated with a continuous DNA sequence. 
Genotype - An unphased 5' to 3' sequence of nucleotide palr(s) found at one or more polymorphic sites In a locus 
on a pair of homologous chromosomes in an individual. As used herein, genotype Includes a full-genotype and/or 
30 a sub-genotype as described below. 

Genotyping - A process for determining a genotype of an Individual. 

Haplotype - A member of a polymorphic set, e.g., a sequence of nucleotides found at one or more of the poly- 
morphic sites In a locus in a single chromosome of an Individual. (See, e.g., HAP 1 In FIGURE 4A full haplotype 
is a member of a full polymorphic set). A sub-haplotype is a member of a polymorphic subset. 
35 Haplotype data - Information concerning one or more of the following for a specific gene: a listing of the haplotype 

pairs In each Individual In a population; a listing of the different haplotypes In a population; frequency of each 
haplotype in that or other populations, and any known associations between one or more haplotypes and a trait. 
Haplotype pair - The two haplotypes found for a locus in a single Individual. 

Haplotyping - A process for determining one or more haplotypes in an Individual and includes use of family ped- 
40 igrees, molecular techniques and/or statistical Inference. 

Isoform - A particular fomi of a gene, mRNA, cDNA or the protein encoded thereby, distinguished from other forms 
by Its particular sequence and/or structure. 

Isogene - One of the two copies (or Isoforms) of a gene possessed by an individual or one of all the copies (or 
Isoforms) of the gene found in a population. An isogene contains all of the polymorphisms present in the particular 

45 copy (or isoforms) of the gene. 

Isolated - As applied to a biological molecule such as RNA, DNA, oligonucleotide, or protein. Isolated means the 
molecule is substantially free of other biological molecules such as nucleic acids, proteins, lipids, carbohydrates, 
or other material such as cellular debris and growth media. Generally, the term "isolated" Is not Intended to refer 
to a complete absence of such material orto absence of water, buffers, or salts, unless they are present In amounts 

50 that substantially interfere with the methods of the present invention. 

Locus - A location on a chromosome or DNA molecule corresponding to a gene or a physical or phenotypic feature. 
Nucleotide pair - The nucleotides found at a polymorphic site on the two copies of a chromosome from an indi- 
vidual. 

Phased- As applied to a sequence of nucleotide pairs for two or more polymorphic sites In a locus, phased means 
55 the combination of nucleotides present at those polymorphic sites on a single copy of the locus is known. 

Polymorphic Set - A set whose members are a sequence of one or more polymorphisms found in a locus on a 
single chromosome of an Individual. See, e.g., the set having members HAP 1 through HAP 10 in FIGURE 4A. 
Polymorphic site - A nucleotide position within a locus at which the nucleotide sequence varies from a reference 
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sequence in at least one individual in a population. Sequence variations can be substitutions, insertions or deletions 
of one or more bases. 

Polymorphic Subset - The polymorphic set whose members are fewer than all the known polymorphisms. 
Polymorphism - The sequence variation observed In an individual at a polymorphic site. Polymorphisms include 
5 nucleotide substitutions, Insertions, deletions and microsatellites and may but need not, result In detectable dif- 

ferences in gene expression or protein function. 

Polymorphism data - Information concerning one or more of the following for a specific gene: location of poly- 
morphic sites; sequence variation at those sites; frequency of polymorphisms in one or more populations; the 
different genotypes and/or haplotypes determined for the gene; frequency of one or more of these genotypes and/ 
10 or haplotypes in one or more populations; any known association(s) between a trait and a genotype or a haplotype 

for the gene. 

Polymorphism Database - A collection of polymorphism data arranged in a systematic or methodical way and 
capable of being Indlviduaiiy accessed by electronic or other means. 

Polynucleotide- A nucleic acid molecule comprised of singie-stranded RNA or DNA or comprised of complemen- 

'5 tary, double-stranded DNA. 

Reference Population ■ A group of subjects or individuals who are representative of a general population and 
who contain most of the genetic variation predicted to be seen in a more specialized population. Typically as used 
In the present invention, the reference population represents the genetic variation in the population at a certainty 
level of at least 85%, preferably at least 90%, more preferably at least 95% and even more preferably at least 99%. 

20 Reference Repository - A collection of cells, tissue or DNA samples from the Individuals In the reference popu- 

lation. 

Single Nucleotide Polymorphism (SNP) - A polymorphism in which a single nucleotide observed In a reference 
individual Is replaced by a different single nucleotide In another Individual. 

Sub-genotype - The unphased 5' to 3' sequence of nucleotides seen at a subset of the known polymorphic sites 
25 In a locus on a pair of homologous chromosomes in a single Individual. 

Subject - An Individual (person, animal, plant or other eukaryote) whose genotype(s) or haplotype(s) or response 

to treatment or disease state are to be determined. 

Treatment - A stimulus administered Internally or externally to an Individual. 

Unphased - As applied to a sequence of nucleotide pairs for two or more polymorphic sites In a locus, unphased 
30 means the combination of nucleotides present at those polymorphic sites on a single copy of the locus {i.e., located 
on a single DNA strand) is not known. 

World Population Group - Individuals who share a common ethnic or geographic origin. 
B. METHODS OF IMPLEMENTING THE INVENTION 

35 

[0035] The present Invention may be Implemented with a computer, an example of which Is shown in FIGURE 1 A. 
The computer includes a central processing unit (CPU) connected by a system bus or other connecting means to a 
communication Interface, system memory (RAM), non-volatile memory (ROM), and one or more otherstorage devices 
such as a hard disk drive, a diskette drive, and a CD ROM drive. The computer may also include an Internal or external 

40 modem (not shown). The computer also includes a display device, such as a CRT monitor or an LCD display, and an 
Input device, such as a keyboard, mouse, pen, touch-screen, or voice activation system. The computer stores and 
executes various programs such as an operating system and application programs. The computer may be embodied, 
for example, as a personal computer, work station, laptop, mainframe, or a personal digital assistant. The computer 
may also be embodied as a distributed multi-processor system or as a networked system such as a LAN having a 

45 server and client terminals. 

[0036] The present Invention uses a program, referred to as the "DecoGenT" application", that generates views (or 
screens) displayed on a display device and which the user can Interact with to accomplish a variety of tasks and 
analyses. For example, the DecoGen™ application may allow users to view and analyze large amounts of information 
such as gene-related data (e.g., gene loci, gene structure, gene family), population data (e.g., ethnic, geographical, 

50 and haplotype data for various populations), polymorphism data, genetic sequence data, and assay data. The Deco- 
Gen^" application Is preferably written In the Java programming language. However, the application may be written 
using any conventional visual programming language such as C, C+-I-, Visual Basic or Visual Pascal. The DecoGen™ 
application may be stored and executed on the computer. It may also be stored and executed in a distributed manner. 
[0037] The data processed by the DecoGen^" application is preferably stored as part of a relational database (e.g., 

55 an Instance of an Oracle database or a set of ASCII flat files). This data can be stored on, for example, a CD ROM or 
on one or more storage devices accessible by the computer. The data may be stored on one or more databases in 
communication with the computer via a network. 

[0038] In one scenario, the data will be delivered to the user on any standard media (e.g., CD, floppy disk, tape) or 
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can be downloaded over the internet. The DecoGen™ application and data nnay also be instaiied on a locai machine. 
The DecoGen^'^ application and data will then be on the machine that the user directly accesses. Data can be trans- 
mitted in the fomi of signals. 

[0039] FIGURE 1 B shows an implementation where a network Interconnects one or more host computers with one 
5 or more userterminais. The communication network may, for example, include one or more local area networks (LANs), 
metropolitan area networks (MANs), wide area networks (WANs), or a collection of interconnected networks such as 
the internet. The network may be wired, wireless, or some combination thereof. The host computer may, for example, 
be a world wide web server ("web server"). The user terminal may, for example, be a client device such as a computer 
as shown In FIGURE 1A. 

10 [0040] A web server stores Information documents called pages. A server process listens for Incoming connections 

from clients (e.g., browsers running on a client device). When a connection Is established, the client sends a request 
and the server sends a reply. The request typically Identifies a page by Its Uniform Resource Locator (URL) and the 
reply includes the requested page. This client-server protocol is typically performed using the hypertext transfer protocol 
("http"). Pages are viewed using a browser program. They are written in a language called hypertext markup language 

15 ("html"). A typical page Includes text and formatting comments called tags. Pages may also Include links (pointers) to 
other pages. Strings of text or Images that are links to other pages are called hyperlinks. Hyperlinks are highlighted 
(e.g., by shading, color, underlining) and may be invoked by placing the cursor on the highlighted area and selecting 
It (e.g., by clicking the mouse button). A page may also contain a URL reference to a portion of multimedia data such 
as an image, video segment, or audio file. Pages may also point to a Java program called an applet. When the browser 

20 connects to where the applet is stored, the applet is downloaded to the client device and executed there In a secure 
manner. Pages may also contain forms that prompt a user to enter infomiatlon or that have active maps. Data entered 
by a user may be handled by common gateway interface (CGI) programs. Such programs may for example, provide 
web users with access to one or more databases. 

[0041] As shown In FIGURE 1 Bthe host computer may include a CPU connected by a system bus or other connecting 

25 means to a communication Interface, system memory (RAM), nonvolatile (ROM), and a mass storage device. The 
mass storage device may for example, be a collection of magnetic disk drives In a RAID system. The mass storage 
device may for example, store the aforementioned web pages, applets, and the like. The host computer may also 
Include an Input device, such as a keyboard, and a display device to allow for control and management by an admin- 
istrator. Additionally the host computer may be connected to additional devices such as printers, auxiliary monitors or 

30 other input/output devices, The input device and display device may also be provided on another computer coupled 
to the host computer. The host computer may be embodied, for example, as one or more mainframes, workstations, 
personal computers, or other specialized hardware platforms. The functionality of the host computer may be centralized 
or may be implemented as a distributed system. As also shown in FIGURE 1 B, the host computer may communicate 
with one or more databases stored on any of a variety of hardware platforms. 

35 [0042] in an Internet scenario, for example involving the system of FIGURE 1 B, the DecoGen^" application will be 
web-based and will be delivered as an applet that runs in a web browser. In this case, the data will reside on a server 
machine and will be delivered to the DecoGen application using a standard protocol (e.g., HTTP with cgl-bin). To 
provide extra security, the network connection could use a dedicated line. Furthermore, the network connection could 
use a secure protocol such as Secure Socket Layer (SSL) which only provides access to the sen/erfrom a specified 

40 set of IP addresses. 

[0043] In another scenario, the DecoGen™ application can be installed on a user machine and the data can reside 
on a separate server machine. Communication between the two machines can be handled using standard client-server 
technology. An example would be to use TCP/IP protocol to communicate between the client and an oracle server 
[0044] it may be noted that In any of the prior scenarios, some or all of the data used by the DecoGen™ application 

fs could be directly Imported into the DecoGen™ application by the user. This Import could be carried out by reading files 
residing on the user's local machine, or by cutting and pasting from a user document into the Interface of the DecoGen^'^ 
application. In yet a further scenario, some or ail of the data or the results of analyses of the data could be exported 
from the DecoGen™ application to the user's local computer. This export could be carried out by saving a file to the 
local disk or by cutting and pasting to a user document, 

50 [0045] in the present Invention various calculations are performed to generate items displayed on a screen or to 
control Items displayed on a screen. As Is well known, some basic calculations may be perfomied using database 
query language (SQL), while other computations are performed by the DecoGen™ application (i.e., the Java program 
which, as previously mentioned, may be an applet downloaded over the internet.) 

55 0. GTS™ METHODS OF THE INVENTION 

[0046] The CTS^" embodiment of present invention preferably includes the following steps: 
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1 . A candidate gene or genes (or other loci) predicted to be involved in a particular disease/condition/drug response 
is determined or chosen. 

2. A reference population of healthy individuals with a broad and representative genetic background Is defined. 

3. For each member of the reference population, DNA Is obtained. 

5 4. For each member of the reference population, the haplotypes for each of the candidate gene(s), (or other ioci) 

5. Population averages and statistics for each of the gene(s) (locl)/hapiotypes In the reference population are 
determined. 

6. (Optional step) An optimal set of genotyping markers Is detemiined. These markers allow an Individual's hap- 
10 iotypes to be accurately predicted without using direct molecular haplotype analysis. The predictive haplotyping 

method relies on the haplotype distribution found for the reference population. 

7. A trial population of individuals with the medical condition of interest Is recruited. 

8. individuals In the trial population are treated using some protocol and their response is measured. They are 
also haplotyped, for each of the candidate gene(s), either directly or using predictive haplotyping based on the 

is genotype. 

9. Correlations between individual response and haplotype content are created for the candidate gene(s) (or other 
loci). From these correlations, a mathematical model Is constructed that predicts response as a function of hap- 
lotype content. 

10. (Optional) Foiiow-up trials are designed to test and validate the haplotype-response mathematical model. 

20 11 . (Optional) A diagnostic method Is designed (using haplotyping, genotyping, physical exam, serum test, etc.) 

to determine those individuals who will or will not respond to the treatment. 

[0047] These steps are now described In further detail below: 

[0048] 1. A candidate gene or genes (or other loci) for the disease/condition Is determined. 

25 [0049] in the CTS embodiment of the Invention, candidate gene(s) (or other loci) are a subset of all genes (or other 
loci) that have a high probability of being associated with the disease of interest, or are known or suspected of interacting 
with the drug being investigated. Interacting can mean binding to the drug during its normal route of action, binding to 
the drug or one of its metabolic products In a secondary pathway, or modifying the drug in a metabolic process. Can- 
didate genes can also code for proteins that are never In direct contact with the drug, but whose environment Is affected 

30 by the presence of the drug, In other embodiments of the Invention, candidate gene(s) (or other loci) may be those 
associated with some other trait, e.g., a desirable phenotypic trait. Such gene(s) (or other loci) may be, e.g., obtained 
from a human, plant, animal or other eukaryote. Candidate genes are identified by references to the literature or to 
databases, or by performing direct experiments. Such experiments Include (1) measuring expression differences that 
result from treating model organisms, tissue cultures, or people with the drug; or (2) perfonnlng protein-protein binding 

35 experiments (e.g., antibody binding assays, yeast 2 hybrid assays, phage display assays) using known candidate 
proteins to identify Interacting proteins whose corresponding nucleotide (genomic or cDNA) sequence can be deter- 
mined. 

[0050] Once the candidate gene(s) (or other loci) are identified, information about them is stored In a database. This 
Information includes, for example, the gene name, genomic DNA sequence, intron-exon boundaries, protein sequence 

40 and structure, expression profiles. Interacting proteins, protein function, and known polymorphisms In the coding and 
non-coding regions, to the extent known or of interest. This Infonnatlon can come from public sources (e.g. GenBank, 
OMIM (Online Inheritance of Man - a database of polymorphisms linked to Inherited diseases), etc.) For genes that 
are not fully characterized, this step would generally require that the characterization be done. However, this is possible 
using standard mapping, cloning and sequencing techniques. The minimum amount of information needed is the nu- 

45 cleotlde sequence for Important regions of the gene. Genomic DNA or cDNA sequences are preferably used. 

[0051] In the present invention, a person may use a user terminal to view a screen which allows the user to see all 
of the candidate genes associated with the disease project and to bring up further information. This screen (as well as 
all the other screens described herein) may for example, be presented as a web page, or a series of web pages, from 
a web server This web based use may involve a dedicated phone line, if desired. Alternatively this screen may be 

50 served over the networkfrom a non-web basedserverormayslmply be generated within the user terminal. An example 
of such a screen referred to herein as a "Pathways" or "Gene Collection" screen Is illustrated In FIGURE 2. 

1 . Illustration Using The CYP2D6 Gene 

55 [0052] FIGURE 2 is an example of a screen showing the set of candidate genes whose polymorphisms potentially 
contribute to the response to a drug or to some other phenotype. The screen shows genes for which data is currently 
available In a database useful In the invention in green; those queued for processing (and for which data will appear 
in a database) would appear in one shade or color, e.g., yellow, and related but unqueued genes (those for which there 
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is currently no plan to deposit data in a database) would appear in another shade or color, e.g., white. Drugs (typically 
ones that interact with one or more of the genes of interest) would be shown in a third shade or color, e.g., light blue. 
The user can select a gene to examine In detail by using the mouse (or other user-input device such as keyboard, 
roller ball, voice recognition, etc.) to select the corresponding icon. In the example depicted In FIGURE 2, CYP2D6, a 

5 cytochrome P 450 enzyme. Is selected, as indicated by the extra black box around the CYP2D6 icon. At the left of 
each screen is a menu that allows the user to navigate through different screens of the data. 
[0053] A preferred embodiment of the present invention relates to situations in which patients have differential re- 
sponses to the drug because they possess different forms of one or more of the candidate genes (or other loci). (Here 
different fomis of the candidate gene(s) mean that the patients have different genomic DNA sequences in the gene 

10 locus). The method does not rely on these differences being manifested in altered amino acids in any of the proteins 
expressed by any candidate gene(s) (e.g., it includes polymorphisms that may affect the efficiency of expression or 
splicing of the corresponding mRNA). All that is required is that there is a con-elation between having a particular form 
(s) of one or more of the genes and a phenctypic trait (e.g. response to a drug). Examples of salient infonnation about 
the candidate genes is given in FIGURES 3-8. 

15 [0054] FIGURE 3 Is an example of a screen showing basic information about the currently selected gene such as 
Its name, definition, function, organism, and length. These pieces of Information typically come from GenBank or other 
public data sources. The figure will typically also show the number of "gene features" (e.g. exons, introns, promoters, 
3' untranslated regions, 5' untranslated regions, etc.) in the database, the size of the analyzed population (group of 
people whose DNA has been examined for this gene), the number of haplotypes found for this gene in this population, 

20 and some measures of polymorphism frequency The information is stored in a database such as the one described 
herein, or calculated from Information stored in such a database. IVIost of the information shown in later figures is 
specific to this analyzed population. Theta and Pi are standard measures of polymorphism frequency, described in 
Ref. 1., Chapter 2. 

[0055] FIGURE 4A and 4B are examples of screens showing the genomic structure of the gene (generally showing 

25 the location of features of the gene, such as promoters, exons, introns, 5' and 3' untranslated regions), as well as 
haplotype information. FIGURE 4A shows the location of the features In the gene, the location of the polymorphic sites 
along the gene, the nucleotides at the polymorphic sites for each of the haplotypes, and the number of times each 
haplotype was seen in the representatives of each of 4 world population groups (CA= Caucasian , AA= African American , 
HL= Mispanic/Latino, AS= Asian) included in the population analyzed forthis gene. All of this data resides in a database 

30 or is calculated from the data in a database. The top view shows the nucleotides at the polymorphic sites, i.e., the 
haplotypes. The middle cartoon shows the features of the gene. In this example the promoter is indicated by a dark 
shaded (or red) rectangular box and a line with an arrow, exons are shown by a gray shaded (or blue) rectangular box 
and introns are shown in white (or in yellow). When the mouse is held over a feature, the feature turns red and the 
name of the feature appears (e.g., in this case. Gene). The code in parenthesis (M22245) is the GenBank accession 

35 numberforthe selected feature. FIGURE 48 Is the same screen as FIGURE 4A, after the user selects the gene feature. 
Under the cartoon of the features are vertical bars indicating the positions of the polymorphic sites, with one row per 
unique haplotype. The letter "d" indicates that there is a deletion. The table at the left gives the number of haplotype 
copies seen in each of the standard populations. For instance, this screen indicates thatthere are 1 0 copies of haplotype 
10 in Caucasians, 2 copies In African Americans, and none in Hispanic/Latinos or Asians, for a total of 12 copies. Note 

40 that the total number of haplotypes is twice the number of individuals examined. At the very bottom is an expanded 
cartoon of the feature. One may display data concerning a particular polymorphism by selecting the corresponding 
vertical bar on the expanded cartoon. The selected bar may be identified, e.g., by a shaded or colored circle. The data 
for the polymorphism appears at the lower left of the screen. This gives the number of copies of each nucleotide (A, 
C,G or T) seen in each of the world population groups. 

45 [0056] FIGURE 5 is an example of a screen showing the actual DNA sequence of the genomic locus for the different 
haplotypes seen In the population (I.e., the sequence of the Isoganes), This view appears In a separate window when 
one of the features in the Gene Structure Screen (FIGURE 4A or 48) is selected with the mouse or other input device. 
This shows an alignment between the full DNA sequences for all of the isogenes of theCYP2D6gene In the database. 
The polymorphic positions are highlighted. 

50 [0057] FIGURE 6 is an example of a screen showing the predicted secondary structure of the mRNA transcript for 
each CYP2D6 isogene in the database. The secondary structure is predicted using a detailed thermodynamic model 
as implemented in the program RNA structure (REF. 2). This is useful because many of the polymorphisms detected 
do not change the amino acid composition of the resulting protein but still lie in the coding region of the gene. One 
result of such a silent mutation could be to alter the intermediate mRNA's structure In a way that could affect mRNA 

55 stability, or how (and If) the mRNA was spliced, transcribed or processed by the ribosome. Such a polymorphism could 
keep any of the protein from being expressed and from being available to carry out its functions. In this screen, the 
user can see thumbnail views of the structures for all of the isogenes and can see a selected one of these structures 
expanded on the right hand side of the screen. Changes in this structure caused by the polymorphisms seen in the 
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isogenes can affect the expression into protein of the gene. The infomnation presented in this screen can serve as an 
aid to the user to detect possible effects of these polymorphisms. 

[0058] FIGURE 7 Is an example of a screen showing a schematic of the structure of the protein expressed by the 
gene, Including Important domains and the sites of the coding polymorphisms. The user gets to this screen by selecting 

5 the "Protein Structure" link at the left hand side of the display. This screen shows various important motifs found in the 
protein, and places the polymorphic sites in the context of these motifs. The user can get information on each motif or 
polymorphism by selecting the appropriate icon for the polymorphic site. In this example, the result of selecting the 
first polymorphic site (as Indicated by the red shadow behind the icon) is shown. The text above at the top shows the 
reference codon and amino acid (CCT. Pro) and the resulting altered codon and amino acid (TCT, Ser). Also given are 

10 the codon frequencies in parentheses. These are calculated by lool<ing at 1 0,000 codons in a variety of human genes 
and calculating how often that particular codon shows up. (REF. 3). 

[0059] 2. A reference population of healthy Individuals with a broad and representative genetic background Is defined. 
[0060] Analysis of the candidate gene(s) (or other loci) requires an approximate knowledge of what haplotypes exist 
for the candidate gene(s) (or other loci) and of their frequencies In the general population. To do this, a reference 

15 population is recruited, or cells from individuals of known ethnic origin aro obtained from a public or private source. 
The population preferably covers the major ethnogeographic groups in the U.S., European, and Far Eastern pharma- 
ceutical markets. An algorithm, such as that described below may be used to choose a minimum number of people in 
each population group. For example, if one wants to have a q% chance of not missing a haplotype that exists in the 
population at a p% frequency of occurring in the reference population, the number of individuals (n) who must be 

20 sampled is given by 2n=log(1-q)/log(1-p) where p and q are expressed as fractions. For instance, if p is 0.05 (i.e., if 
one wants to find at least one copy of all haplotypes found at greater than 5% frequency) and q is 0.99 (i.e., one wants 
to be sure to the 99% level of confidence of finding the >5% frequency haplotypes), then n=0.5*log(.01)/log(.95)~45. 
There Is always a tradeoff between how rare a haplotype one wants to be guaranteed to see and the cost of experi- 
mentally determining haplotypes, 

25 [0061] 3. For each member of the population, DNA is obtained. 

[0062] In the preferred embodiment, for each member of the reference population (called a subject), blood samples 
are drawn, and, preferably. Immortalized cell lines are produced. The use of immortalized cell lines is preferred because 
it is anticipated that individuals will be haplotyped repeatedly, 1 .e., for each candidate gene (or other loci) in each disease 
project. As needed, a cell sample for a member of the population could be taken from the repository and DNA extracted 

30 therefrom. Genomic DNA or cDNA can be extracted using any of the standard methods. 

[0063] 4. For each member of the population, the haplotypes for each of the candidate gene(s) (or other loci) are 
found. 

[0064] The 2 haplotypes for each of the subject's candidate gene(s) (or other loci) are detemiined. The most preferred 
method for haplotyping the reference population is that described in U.S. Application Serial No. 60/1 98,340 (inventors 

35 Stephens et al.), filed April 18, 2000, which Is specifically incorporated by reference herein. Another, less preferred 
embodiment for haplotyping the reference population, uses the CLASPER System^'*' technology (Ref. U.S. Patent 
Numbers, 866,404), which Is a technique for direct haplotyping. Other examples of the techniques for direct haplotyping 
Include single molecule dilution ("SMD") PGR (Ref. 9) and allele-specific PGR (Ref. 10). However, for the purpose of 
this invention, any technique for producing the haplotype infonnation may be used. 

40 [0065] The infomiation that is stored in a database, such as a database associated with the DecoGen application 
exemplified herein includes (1) the positions of one or more, preferably two or more, most preferably all, of the sites 
in the gene locus (or other loci) that are variable (i.e. polymorphic) across members of the reference population and 
(2) the nucleotides found for each individuals' 2 haplotypes at each of the polymorphic sites. Preferably, it also includes 
individual Identifiers and ethnicity or other phenotypic characteristics of each individual. 

45 [0066] In the preferred embodiment of the Invention, the haplotypes and their frequencies are stored and displayed, 
preferably in the manner shown, e.g., in FIGURES 4A and 4B. Haplotypes and other information about each of the 
members of the population being analyzed can be shown, for example, in the manner shown In FIGURE 8. The infor- 
mation shown in FIGURE 8 includes a unique identifier (PID), ethnicity, age, gender, the 2 haplotypes seen for the 
Individual, and values of all clinical measurements available for the individual. Quantitative values of clinical measures 

50 would ordinarily be seen by scrolling to the right. However, for the subjects seen in this view, there is no clinical data. 
This is because this is the reference population of healthy individuals. 

[0067] The haplotype data may also be presented in the context of the entire DNA sequence. Examples of the se- 
quences of the isogenes, with the polymorphisms highlighted, are shown in FIGURE 5. 

[0068] Because an individual has 2 copies of the gene (2 isogenes), and because these 2 copies are often different, 
55 some of the polymorphic sites will show 2 different nucleotides in a genotype, one from each of the Isogenes. A genotype 
from an Individual with haplotypes TAG and GAG would be (T/G),A,(G/G). This is consistent with the haplotypes TAG/ 
GAG or TAG/GAG. The fact that we do not know which haplotypes gave rise to this genotype leads us to call this an 
"unphased genotype". If we haplotype this individual we then detemiinethe "phased genotype", which describes which 



16 



EP 1 233 365 A2 



particular nucleotides go together in the haplotypes. Phasing is the description of which nucleotide at one polymorphic 
site occurs with which nucleotides at other sites. This information is left ambiguous (i.e., unphased) in a genotyping 
measurement but is resolved (I.e., phased) In a haplotype measurement. 

[0069] FIGURE 9 is an example of a screen showing the genotype to haplotype resolution for each of the individuals 
5 In the population being examined. At the left of the screen is ashaded (orcolor) matrix showing the genotype Information 
at each of the polymorphic sites for each Individual (sites across the top, individuals going down the page). The most 
and least common nucleotide at each site is defined by looking at both haplotypes of all individuals in the population 
at that particular site. The nucleotide that shows up most often is called the most common nucleotide. The one that 
shows up less often is temiedthe least common. In situations where more than 2 nucleotides are seen at a site (which 
10 is rare but not unknown in human genes) all nucleotides except the most common one are lumped together in the least 
common category. At the right Is a shaded (or color) matrix showing the haplotype resolution. In the genotype view, a 
blue square indicates that the individual Is homozygous for the most common nucleotide at that site. A yellow square 
Indicates that the individual is homozygous for the least common base, and a red square Indicates that the individual 
Is heterozygous at the site. On the right hand side, a row for an individual Is broken Into a top and a bottom half, each 
is representing one of the two haplotypes. The color scheme is the same as on the left except that all of the heterozygous 
sites have been resolved. The + and - buttons are for zooming in and out. 

[0070] Unrelated Individuals who are heterozygous at more than I site cannot be haplotyped without (1) using a direct 
molecular haplotyping method such as CLASPER System^" technology or (2) making use of knowledge of haplotype 
frequencies in the population, as described below or, preferably, as described In U.S. Application Serial No. 60/1 98,340 

20 (inventors Stephens et al.), filed April 18, 2000. 

[0071] 5. Population averages and statistics for each of the haplotypes in the reference population are determined. 
[0072] Once the individual haplotypes of the reference population have been determined the population statistics 
may be calculated and displayed in a manner exemplified herein in FIGURE 10. FIGURE 10 is an example of one of 
several screens showing infomiation about the pair of haplotypes for the candidate gene(s) (or other loci) found in an 

25 Individual. In this screen, each cell of the matrix displays some information about the group of people who were found 
to have the haplotypes corresponding to the particular row and column. In all of these screens, subjects can be grouped 
together by pairs of haplotypes or sub-haplotypes, where a sub-haplotype is made up of a subset of the total group of 
polymorphic sites. For example, at the top of the screen in the figure are checkboxes allowing the user to select the 
subset of polymorphic sites to be examined (here sites 2 and 8 are chosen). The + and - buttons are for zooming in 

30 and out, which increases and decreases the viewing size of the matrix. The "Recalculate" button causes the statistics 
for the groups to be recalculated after a new subset of polymorphic sites has been selected. At the boftom Is the matrix. 
The selected cell (outlined in green in this figure) displays information about subjects who are homozygous for C and 
G at sites 2 and 8. The text to the right gives summary numerical information about the subjects in that box. In particular, 
this screen shows the distribution of subjects in the different ethnogeographic groups with each of the haplotype pairs. 

35 In this example, 23 subjects (1 8 Caucasians and 5 Asians) were found to be homozygous for C and G at sites 2 and 
8. In this example, the heights of the bars are normalized individually for each cell so that it is not possible in this 
example to see relative numbers of individuals cell to cell by looking at the heights. An alternative normalization (In 
which there Is a consistent normalization for all boxes), is also possible. More detailed information is available by 
selecting the "View Details" button at the top (see FIGURE 11). 

40 [0073] FIGURE 11 is a more detailed view of the infomiation that is available from the summary view shown in 
FIGURE 10. At the bottom, one row is shown for each haplotype pair found in the population being analyzed. Each 
row shows the corresponding 2 sub-haplotypes, the total number of individuals found with that sub-haplotype and the 
fraction of the total population represented by this number. Next to these are 3 columns for each ethnogeographic 
group. The first gives the number of individuals in that ethnogeographic group with that haplotype pair. The second 

45 gives the fraction of individuals (found in a database of the present invention) in that world population group who have 
that haplotype pair. The third column gives the expected number based on Hardy-Weinberg equilibrium. 
[0074] The observed haplotype pair frequencies In the population in particular, the reference population, are prefer- 
ably corrected for finite-size samples. This Is preferably done when the data is being used for predictive genotyping. 
If it is assumed that each of the major population groups will be in Hardy-Weinberg equilibrium, this allows one to 

50 estimate the underlying frequencies for haplotype pairs in the reference population that are not directly observed. It Is 
necessary to have good estimates of the haplotype-pair frequencies in the reference population in order to predict 
subjects' haplotypes from indirect measurements that will be used in a diagnostic context (see item 6). Preferably the 
reference population has been chosen to be representative of the population as a whole so that any haplotypes seen 
In a clinical population have already been seen in the reference population. Furthermore, it would be possible to de- 

55 termine whether certain haplotypes are enriched In the patient population relative to the reference population. This 
would indicate that those haplotypes are causative of or correlated with the disease state. 

[0075] Hardy-Weinberg equilibrium (Ref . 1 , Chapter 3) postulates that the frequency of finding the haplotype pair H, 
/ is equal to Ph-m^W/Hj) = 2p(H^)p(H^ if * and Ph.v\^H^'H2) =p(Hi)p(H2) if = Here, p(H,) (where t1 
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or 2) is the probability of finding the haplotype Hj in the population, regardless of whatever other haplotype it occurs 
with. Hardy-Weinberg equilibrium usually holds in a distinct ethnogeographic group unless there is significant inbreed- 
ing or there is a strong selective pressure on a gene. Actual observed population frequencies Pobs(H^ I H2) and the 
corresponding Hardy-Weinberg predicted frequencies PH-vi/iH-\ I are shown in FIGURE 11 , discussed above. 
5 [0076] if large deviations from Hardy-Weinberg equilibrium are observed in the reference population, the number of 
individuals can be increased to see if this is a sampling bias. If it is not, then it may be assumed that the haplotype is 
either historically recent or is under selection pressure. A statistical test may be used, e.g., test is 



If so, the variation is iarge. 

[0077] 6. (Optional - this step can be skipped if direct molecular haplotyping will be used on all clinical samples.) An 
15 optimai set of genotyping marl<ers is determined. These markers often allow an Individual's haplotypes to be accurately 
predicted without using full haplotype analysis. This genotyping method relies on the haplotype distribution found di- 
rectiy from the reference population. 

[0078] One of several methods to test subjects for the existence of a given pair of haplotypes in an individual can 
be used. These methods can Include finding surrogate physical exam measurements that are found to correlate with 
20 haplotype pair; serum measurements (e.g., protein tests, antibody tests, and small molecule tests) that correlate with 
hapiotype pair; or DNA-based tests that correlate with haplotype pair. An example that is used herein is to predict 
hapiotype pair based on an (unphased) genotype at one or more of the polymorphic sites using an aigorithm such as 
the one described further below. 

[0079] For example, as discussed above, in the case where the two hapiotypes are TAG and GAT, the genotyping 

25 information wouid oniy provide the information that the subject is heterozygous T/G at site 1 , homozygous A at site 2 
and heterozygous G/T at site 3, This genotype is consistent with the following hapiotype pairs: TAG/GAT (the correct 
one) and GAG/TAT (the incorrect one). Assuming that the underlying probability (as measured in the reference popu- 
lation) f orTAC/GAT is p% and for GAG/TAT is q%, subjects may be randomly assigned to thefirst group with a probability 
p/(p-i-q) and to the second group with a probability q/(p+q). If p»q, then subjects will almost always be correctly assigned 

30 to the correct haplotype pair group if they are TAG/GAT, but the GAG/TAT individuals will always be mis-classified. 
However, the majority of individuals will be assigned to the correct haplotype-pair group. In the case that q=0, the 
correct assignment will always be made. For cases where p~q, this classification gives very low accuracy predictions, 
so other methods to resolve the subjects' haplotypes must be resorted to. One can always directly find the correct 
haplotypes using CLASPER System^" technology or other direct molecular haplotyping method. 

35 [0080] The ability to use genotypes to predict haplotypes is based on the concept of linkage. Two sites in a gene are 
linked if the nucleotide found at the first site tends to be correlated with the nucleotide found at the second site. Linkage 
calculations start with the linkage matrix, which gives the probabilities of finding the different combinations of nucleotides 
at the two sites. For Instance, the following matrix connects 2 sites, one of which can have nucleotide A or T and the 
other of which can have nucleotide G or C. The fraction of individuals in the population with A at site 1 and G at site 2 

40 is 0.15. 



[0081] In general, the matrix is given by 



Site 2 - Allele 2 



[0082] The vaiues p^_^ and pj^. give the sum of the respective rows while the values p^., and p^2 the sum over 
the respective coiumns. By definition, p^^ + = P+i + P+2 =1 ■ Three standard measures of linkage disequilibrium that 
are used are: (Ref . 1 , Chapter 3) 
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[0083] FIGURE 12 is an example of a screen showing a measure of the linkage between different polymorphic sites 
inthegene. Measures of linkage tell how well we can predict the nucleotide at one polymorphicsite given the nucleotide 
at another site, A high value of the linkage measure indicates a high level of predictive ability. This screen shows D'. 
The color of the square In the display at the Intersection of site a and p indicates the value of the linkage measure. 
Red Indicates strong linkage and blue indicates weak to non-existent linkage. White squares In a row Indicate that the 
corresponding polymorphic site has no variation In the population being examined. Such sites are included because 
there is information about the presence of polymorphisms other than that provided by our haplotype analysis. This 
would be the case if a polymorphism was reported In the literature which we were not able to detect in our population. 
The values to the right of the matrix give /w/ipfor each of the sites. If^^p is a measure of the infonnation content of the 
single site and is given by 



where W^^pis the number of distinct haplotypes observed: P(j) is the probability of finding hapiotype / and P(yl/) 
is the conditional probability of finding haplotype y witli nucieotide /, (The conditional probability P(/l/) is the probability 
of finding hapiotype /in the subset of all observations where nucleotide / is seen.) High vaiues of l^^p (-2.0) indicate 
that at least some pairs of observed hapiotypes can be distinguished by looking at that single site, Smaii values (1 .0) 
indicate that the particular site is not informative for distinguishing any pair of hapiotypes. This same method can be 
used for subhaplotypes. These values are useful for choosing sites for genotyping, as described above. The + and - 
boxes are for zooming in and out. 

[0084] FIGURE 13, 14, and 15 show views of a tool for performing an analysis of which polymorphic sites may be 
genotyped in order to determine an individual's hapiotypes by the method of predictive haplotyping, rather than using 
more expensivedirecthaplotyping methods, such astheCLASPER-System™ method of hapiotyping. In these screens, 
one chooses a subset of polymorphic sites of interest (the entire hapiotype ora sub-haplotype can be examined) and 
then a subset of sites at which the subject is to be genotyped. The colors in the hapiotype-pair boxes then indicate the 
fraction of individuals in that box who are correctly haplotyped based on the statistical model described in the previous 
paragraph. FIGURE 14 gives the predicted values and FIGURE 15 shows a tool for directly finding the optimal set of 
genotyping sites. 

[0085] The purpose of the three screens In FIGURE 13, 14 and 15 Is to provide an example of the tools to find the 
simplest genotyping experiment that could detect an individual's hapiotypes. The basic layout of the screen in FIGURE 
13 is the same as described in FIGURE 10. The top row of checkboxes is used to the haplotype or subhaplotype which 
is desired to be detennined. There is one other row of checkboxes beneath those for choosing the hapiotype or sub- 
haplotype. This second row, labeled "Genotype Loci", allows the user to select a subset of positions at which to gen- 
otype. The color of the square in the matrix indicates the fraction of individuals who are actually in that category who 
would be correctly categorized using this sub-genotype. For example, this screen shows that individuals homozygous 
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forTGG at positions 2, 3, and 8 would be correctly haplotyped by genotyping at positions 2 and 8. Selection of optimal 
genotyping sites Is aided by Information from the Linkage View (FIGURE 12). Typically one will only need to genotype 
one site of a pair of polymorphic sites that are in strong linkage. 

[0086] The screen In FIGURE 14 gives a numerical view of the data show in FIGURE 13. One can see that If we 
5 genotype at sites 2 and 8, one could assign individuals to the TGG/TGG group with 1 00% confidence (based on the 
data obtained for the reference population). However, one would have low confidence in the ability to assign individuals 
to the CAG/CGG group. 

[0087] FIGURE 1 5 is an example of a screen showing the results of a tool for directly finding the optimal genotyping 
sites. This screen gives the results of a simple optimization approach to finding the simplest genotyping approach for 

10 predicting an individual's haplotypes. For each haplotype pair, the predictive abilities of all single site genotyping ex- 
periments are calculated. If any of these has a predictive ability of greater than some cutoff (say 90%), then that single- 
site genotype test is shown, A single-site genotype test is one In which an individual's nucleotide(s) is found at that 
single site. This can be done using any of several standard methods including DNA sequencing, single-base extension, 
allele-specific PGR, orTOF-mass spec, (In the figure, a red box Indicates that individuals should be genotyped at that 

15 site, and a white box indicates that the Individual should not be genotyped there.) If no single-site test has a predictive 
ability of greater than the cutoff, then the calculated predictive ability of all 2-site genotyping tests are examined by the 
computer program. The first 2-site test whose predictive ability exceeds the cutoff is then displayed. If no 2-site test is 
successful, then the predictive ability of all 3-sites tests are examined by the computer program, and so on. The mask 
at the right hand side of this display shows the first test found that exceeded the cutoff value. 

20 [0088] An improved method for finding optimal genotying sites is described in section D, below. 

[0089] FIGURES 16 and 17 are examples of screens demonstrating another tool for analyzing linkage. This tool is 
a minimal spanning network which shows the relatedness of the haplotypes seen in the population (Ref , 8). Haplotypes 
are amenable to modes of analysis that are not available for isolated variants (e.g,, SNPs). In particular, a sample of 
haplotypes reflects the actual phylogenetic history of the genetic locus. This history includes the divergence patterns 

25 among the haplotypes, the order of mutational and recombinational events, and a better understanding of the actual 
variation among the different populations comprising thesample. These considerations are Important in the assessment 
of a locus's involvement In a particular phenotype (e.g., differential response to a drug or adverse side effects). The 
phylogenetic algorithms included in the DecoGen™ application are both exploratory and analytical tools. In that they 
allow consideration of partial haplotypes as well as those based on the full set of haplotypes in the context of clinical 

30 data. The checkboxes and recalculate button shown in FIGURES 16 and 17 serve the purpose of selecting sub-hap- 
lotypes as described under FIGURE 10. The results of the calculations are shown in real time, i.e., the sizes and 
positions of the balls, as well as the length of the lines, change as the calculation progresses. Here a circle represents 
a haplotype. The distance between haplotypes is a rough measure of the number of nucleotides that would have to 
be flipped to change one haplotype into the other Pairs of haplotypes separated by one nucleotide flip are connected 

35 with black lines. Pairs connected by 2 flips are connected with light blue lines. The size of the haplotype ball increases 
with the frequency of that haplotype in tine population Each haplotype orsubhaplotype ball is labeled with the relevant 
nucleotide string. The user can toggle the labels off and on by selecting the haplotype ball, e g,, with a mouse. The -i- 
and - boxes are for zooming In and out. The "View Hap Pairs" box serve the purpose of showing the pairing Information 
for haplotypes, The lines shown in this figure are replaced with lines connecting pairs of haplotypes seen in each 

40 Individual. The colors in the balls, and the pie shaped pieces, represent the fraction of that haplotype found in the major 
ethnogeographic group. Red represents Caucasian, blue African-American, Light Blue Asian, Green Hispanic/Latino. 
The IWinlmum Size checkbox allows the user to select sub-haplotypes as in earlier Figures (see FIGURE 10). 
[0090] This aspect of the invention relates to a graphical display of the haplotypes (including sub-haplotypes) of a 
genegroupedaccordingtotheirevolutionaryrelatedness. Asused herein, "evolutionary relatedness" of two haplotypes 

45 is measured by how many nucleotides have to be flipped in one of the haplotypes to produce the other haplotype. 
[0091] In one embodiment, the display is a minimal spanning network in which a haplotype is represented by a 
symbol such as a circle, square, triangle, star and the like. Symbols representing different haplotypes of a gene may 
be visually distinguished from each other by being labeled with the haplotype and/or may have different colors, different 
shading tones, cross-hatch patterns and the like. Any two haplotype symbols are separated from each other by a 

50 distance, referred to as the Ideal distance, that is proportional to the evolutionary relatedness between their represented 
haplotypes. For example, if displaying a group of haplotypes related by one, two or three nucleotide flips, the propor- 
tional distances between the haplotype symbols could be one inch, two inches, and three inches, respectively. The 
haplotype symbols may be connected by lines, which may have different appearances, i.e., different colors, solid vs. 
dotted vs. dashed, and the like, to help visually distinguish between one nucleotide flip, two nucleotide flips, three 

55 nucleotide flips, etc, 

[0092] In a preferred embodiment, the method is implemented by a computer and the graphical display Is produced 
by an algorithm that connects haplotype symbols by springs whose equilibrium distance is proportional to the Ideal 
distance. Preferably, the size of a particular haplotype symbol is proportional to the frequency of that haplotype in the 
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population. In addition, the haplotype symbol may be divided into regions representing different ciiaracteristics pos- 
sessed by members of the population, sucti as ethnicity, sex, age, or differences in aphenotypesuch as height, weight, 
drug response, disease susceptibility and the like. The different regions in a haplotype symbol may be represented by 
different colors, shading tones, stippling, etc. In a particularly preferred embodiment, generation of the graphical display 

5 Is shown in real time, i.e., the positions and sizes of haplotype symbols, as well as the lengths of their connecting 
springs, change as the algorithm-directed organization of the haplotypes of a particular gene proceeds. 
[0093] The resulting display provides a visual impression of the phylogenetic history of the locus. Including the di- 
vergence patterns among the haplotypes for that locus, as well as providing a better understanding of the actual var- 
iation among the different populations comprising the sample. These considerations are important in the assessment 

10 of the encoded protein's involvement in a particular phenotype (e.g., differential response to a drug or adverse side 
effects). In addition, a spanning network generated for haplotypes in a clinical population using the same algorithm 
may be superimposed on the spanning networl< for the reference population to analyze whether the haplotype content 
of the clinical population is representative of the reference population. 

[0094] 7. A trial population of individuals who suffer from the condition of interest is recruited. 
15 [0095] The end result of the CTS method is the correlation of an underlying genetic makeup (In the form of haplotype 
or sub-haplotype pairs for one or more genes or other loci) and a treatment outcome. I n order to deduce this correlation 
It is necessary to run a clinical trial or to analyze the results of a clinical trial that has already been run. Individuals who 
suffer from the condition of interest are recruited. Standard methods may be used to define the patient population and 
to enroll subjects. 

20 [0096] Individuals in the trial population are optionally graded for the existence of the underlying cause (disease/ 
condition) of interest. This step will be important in cases where the symptom being presented by the patients can 
arise from more than one underlying cause, and where treatment of the underlying causes are not the same. An 
example of this would be where patients experience breathing difficulties that are due to either asthma or respiratory 
Infections. If both sets were Included In a trial of an asthma medication, there would be a spurious group of apparent 

25 non-responders who did not actually have asthma These people wouid degrade any correlation between haplotype 
and treatment outcome. 

[0097] This grading of potential patients could empioy a standard physical exam or one or more lab tests. It could 
also use haplotyping for situations where there was a strong correlation between haplotype pair and disease suscep- 
tibility or severity 

30 [0098] 8. Individuals in the trial population are treated using some protocol and their response is measured. In ad- 
dition, they are haplotyped, either directly or using predictive genotyping. 

[0099] This step is straightforward. If patients are to be haplotyped for the candidate genes, a direct molecular hap- 
lotyping method could be used. If they are to be indirectly haplotyped, a method such as the one described above in 
Item 6 could be used. Clinical outcomes in response to the treatment are measured using standard protocols set up 
35 for the clinical trial. 

[0100] 9. Correlations between individual response and haplotype content are created for the candidate genes. From 
these correlations, a mathematical model is constructed that predicts response as a function of haplotype content. 
[0101] Correlations may be produced in several ways. In one method averages and standard deviations for the 
haplotype-pair groups may becalculated. This can also be done for sub-haplotype-pair groups. These can be displayed 
40 in a color coded manner with low responding groups being colored one way and high responding groups colored 
another way (see, e.g., FIGURE 18). Distributions in the form of bar graphs can also be displayed (see, e.g., FIGURE 
19), as can all group means and standard deviations (see, e.g., FIGURE 20). 

[0102] The infonnation in FIGURES 18-24 may be used to detennine whether haplotype information for the gene 
being examined can be used to predict clinical response to the treatment. One question that can be answered Is whether 

45 there is a significant difference in response between groups of individuals with different haplotype pairs. FIGURES 
18-22 show screens of the data that connect haplotypes with clinical outcomes. The example shown in FIGURE 18 
and the next several screens gives the results of a simulated clinical trial run to test the link between patients' haplotypes 
for CYP2D6 and a phenotypic response called "Test". The main layout of this page is the same as described in FIGURE 
10. At the left side of this view is a list of the clinical measurements performed on the patients. This list Is completely 

50 generic as far as the invention is concerned. Selecting the relevant radio button will bring up data for any of the clinical 
measurements. (Only one "Test" radio button shown here, but there may be many, con-esponding to different tests, 
with appropriate labels.) In this view, the color in a cell of the matrix indicates the mean value of the measurement for 
the individuals in that haplotype-pair group. When one of the cells is selected, text appears at the right, giving the 2 
haplotypes, the number of patients in the cell, the mean value and standard deviation for individuals In the cell. A slide 

55 bar Is present below the color boxes near the top of the screen indicating 0% to 100% so that moving, e.g., one or both 
of the ends of the bar will change the color scale in the color boxes at the top of the screen as well as the colors in the 
matrix. (IMote that a slide bar may be used with ay screen with similar colored (or othenwise graded) boxes). FIGURE 
19 is a screen showing the distribution of the patients in each cell of the clinical measurement matrix of FIGURE 18. 
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In this case, the histograms are collectively normalized so that the user can directly compare frequencies from one 
cell to the next. The screen in FIGURE 20 is brought up when the user selects any of the cells in the haplotype-pair 
matrix in FIGURE 1 9. This shows the number of patients in the various response bins indicated on the horizontal axis. 
A response bin simply counts the number of individuals whose response is within a particular interval. For instance, 

5 there are 7 individuals in the response bin from 0.2 to 0.25 in FIGURE 20. 

[0103] The result of regression calculation shown in FIGURE 21 (which calculation is described beiow) allows the 
user to see which polymorphic sites give the most significant contribution to the differences in phenotype. This display 
comes up in a separate window when the user pushed the "Regression" button on the "Clinical Measurements vs. 
Haplotype View" (FIGURES 18, 19, or 21). Shown are the results of a dose-response linear regression calculation on 

10 each of the individual polymorphisms (REF 4, Chapter 9). In this case, sites 2 and 8 are most predictive, as indicated 
by their large values of the significance level. This fact would lead the user to examine the site 2/8 sub-hapiotypes as 
in FIGURE 22. This screen gives a detailed view of the mean and standard deviation values for each of the cells in 
FiGURE 18. Also shown are the Chi-squared value for the distributions. These values indicate how close the distribu- 
tions in each haplotype-pair group are to normal. The function Q(chi-squared) gives a levei of statistical significance. 

15 If Q>0.05 the user could not reject the hypothesis that the distribution is normai. FiGURE 22 shows that groups having 
different 2/8 sub-haplotypes can have very different mean values of the Test phenotype. To see if this group-to-group 
variation is significant, the user could ask the DecoGen™ application to perform an ANOVA (Anaiysis of Variation) 
calculation. The results of an ANOVA calculation are shown in FIGURE 23. Selecting the ANOVA button on any of the 
earlier Clinical Measurements views brings up this display. This view uses standard calculation methods to see If the 

20 variation in clinical response between haplotype-pair groups is statistically significant. The methods used are described 
in Ref. 4, Chapter 1 0. FIGURE 23 shows that the variation between different 2/8 subhaplotype groups is statistically 
significant at the 99% confidence level. 

[0104] The regression model used In FIGURE 21 starts with a model of the fonn 

25 

r=ro + Sxd (5) 

where r is the response, Tq is a constant called the "intercept", S is the slope and d is the dose. As discussed 
previousiy the most-common nucleotide at the site and the least common nucleotide are defined. For each individuai 

30 in the population, we calculate his "dose" as the number of least-common nucleotides he has at the site of interest. 
This value can be 0 (homozygous forthe least-common nucleotide), 1 (heterozygous), or2 (homozygous for the most 
common nucleotide). An Individual's "response" Is the value of the clinical measurement. Standard linear regression 
methods are then used to fit all of the individuals' dose and response to a single model. The outputs of the regression 
caicuiation are the intercept rg, the slope S, and the variance (which measures how weii the data fits this simple linear 

3^ model). The Students t-test value and the ievei of significance can then be caiculated. This figure shows the relevant 
variabies (site, siope S, intercept Tq, variance, Student's t-test vaiue and level of significance) for each of the sites. 
[0105] From the results shown in FiGURE 21 , the user would see that the nucleotides at site 2 and 8 have significant 
contributions to the Test variable. This result would be interpreted as follows. Averaging over all variables other than 
the nucleotides at site 2, the Test variabie can be predicted by 

Test = 0.231 -t- 0.154 X (number of Ts at site 2). 
[0106] On average, an individual homozygous for C at site 2 will have a response of 0.231 . Heterozygous individuals 
have an average response of 0.385, and individuals homozygous for T have an average response of 0.539. This trend 
is significant at the 99.9% confidence level. It is important to note that the calculation of significance (the Student's t- 
test) is based on the assumption that the distribution of responses for individuals (such as seen in FIGURE 20) are 

"'^ normally distributed. The present invention can incorporate any of the standard methods for calculating statistical sig- 
nificance for non-normal distributions. Furthermore, the present invention can include more complex dose-response 
caicuiations that examine multiple sites simultaneously. See, e.g., Ref, 4, 

[0107] A second method for finding correlations uses predictive models based on error-minimizing optimization al- 
gorithms. One of many possible optimization algorithms is a genetic algorithm. (Ref. 5). Simulated annealing (Ref. 6, 
50 Chapter 10), neural networks (Ref. 7, Chapter 18), standard gradient descent methods (Ref. 6, Chapter 10), or other 
global or local optimization approaches (See discussion in Ref. 5) could also be used. As an example (one that is 
currently implemented in the DecoGen™ application) a genetic algorithm approach is described herein. This method 
searches for optimal parameters or weights in linear or non-linear models connecting haplotype loci and clinical out- 
come. One model is of the form 
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where C is the measured clinical outcome, / goes over all polymorphic sites, a over all candidate genes, Cq, Wj^^ 
and w'la are variable weight values, Ri^ is equal to 1 1f site /In gene a In the first haplotype takes on the most common 
nucleotide and -1 If it takes on the less common nucleotide. L;„ is the same as except for the second hapiotype. 

10 The constant term Cq and the weights w^^ and w',-^ are varied by the genetic algorithm during a search process that 
minimizes the error between the measured value of C and the value calculated from Equation 6. iVIodeis other than 
the one given in Equation 6 can be easily incorporated. The genetic algorithm is especially suited for searching not 
only over the space of weights in a particular model but also over the space of possible models. (Ret. 5) 
[0108] Correlations can also be analyzed using ANOVA techniques to detemiine how much of the variation in the 

15 clinical data is explained by different subsets of the polymorphic sites in the candidate genes. The DecoGen™ appli- 
cation has an ANOVA function that uses standard methods to calculate significance (Ref. 4, Chapter 10). An example 
of an interface to this tool is shown in FIGURE 23. 

[0109] ANOVA is used to test hypotheses about whether a response variable is caused by or correlated with one or 
more traits or variable that can be measured. These traits or variables are called the independent variables. To carry 

20 out ANOVA, the Independent varlable(s) are measured and people are placed into groups or bins based on their values 
of the variables. In this case, each group contains those individuals with a given haplotype (or sub-haplotype) pair. The 
variation In response within the groups and also the variation between groups is then measured, if the within-group 
variation is large (people in a group have a wide range of responses) and the variation between groups is small (the 
average responses for all groups are about the same) then It can be concluded that the Independent variables used 

25 forthe grouping are not causing or correlated with the response variable. For Instance, If people are grouped by month 
of birth (which should have nothing to do with their response to a drug) the ANOVA calculation should show a low level 
of significance. IHere, as shown In FIGURE 23, each haplotype-pair group Is made up of the Individuals In the population 
who have that hapiotype pair. The table at the bottom shows the number of Individuals in the group, the average 
response ("Test") of those individuals, and the standard deviation of that response. At the top Is a table showing Infor- 

30 matlon comparing the "Between Group" calculation and the "Within Group" calculations. The details are given In the 
reference. [Ref. 4] If the variation (the "Mean Squares" column) is larger for the "Between Groups" than forthe "Within 
Groups" set, we will have an F-ratio (="Between Groups" divided by "Within Groups") greater than one. Large values 
of the F-ratio Indicate that the independent variable Is causing or correlated with the response. The calculated F-ratio 
is compared with the critical F-dlstributlon value at whatever level of significance Is of interest. If the F-ratIo Is greater 

35 than the Critical F-dlstributlon value, then the user may be confident that the Independent variable Is predictive at that 
level. In this example, the user may would see that grouping by haplotype-pair for sites 2 and 8 for CYP2D6 gives 
significant probability at the 99% confidence level. The conclusion from this Is that an Individual's haplotypes at these 
positions in this gene is at least partially responsible for or is at least strongly correlated with the value of Test. 
[0110] FIGURE 24 shows a screen which is an example interface to the modeling tool (I.e., the CTS™ Modeler) 

40 described herein. At the right are controls to set the parameters for the genetic algorithm (Ref. 5). In the center is a 
graph showing the residual error of the model as a function of the number of genetic algorithm generations. At the 
bottom Is a bar graph showing the current best weights for Eq. 6. In this example, the linear model described in Eq. 4 
is used to find optimal weights forthe polymorphic sites. The final parameters arrived at are Co = 0.1 and w^cypzdb^^-'^^ 
and w'q qypzd^'^ '^- This says that the response variable 'Test" can be predicted from the formula: 

45 

Test = 0.1 + [.15 X (Number of Gs In position z) + 0.1 x (Number of As in position 
8)] X 2 where "number" refers to the number in the two haplotypes for an Individual. 

50 

[Oil 1] 10. Preferably, follow-up trials are designed to test and validate the haplotype-response mathematical model. 
[0112] The outcome of Step 9 is a hypothesis that people with certain haplotype pairs or genotypes are more likely 
or less likely on average to respond to a treatment. This model is preferably tested directly by running one or more 
additional trials to see if this hypothesis holds. 
55 [0113] 11 . A diagnostic method is designed (using one or more of haplotyping, genotyping, physical exam, serum 
test, etc.) to determine those individuals who will or will not respond to the treatment. 

[0114] The final outcome of the GTS™ method is a diagnostic method to Indicate whether a patient will or will not 
respond to a particular treatment. This diagnostic method can take one of several forms - e.g., a direct DNA test, a 
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serological test, or a physical exam measurement. The only requirement is that there is a good correlation between 
the diagnostic test results and the underlying haplotypes or sub-haplotypes that are in turn correlated with clinical 
outcome. In the preferred embodiment, this uses the predictive genotyping method described in item 6. 

5 2. Illustration With ADRB2 Gene 

[0115] Figure 26 is the opening screen for the Asthma project. This screen appears after the "Asthma" folder has 
been selected from among the projects shown at the left. Selecting a folder causes the genes associated with that 
project to become active. Genes l<nown or suspected of being involved in asthma are shown in the screen in "Extra- 
10 cellular" and "Intracellular" compartments. The text "Active Gene: DAXX" is a default value; "DAXX" will be replaced 
with the name of whatever gene is selected from this window. Selecting ADRB2, and then "Geneinfo" from the menu 
at left, brings up Figure 27. 

[01 1 6] Figure 27 presents data and statistics related to the ADBR2 gene. Selecting "GeneStructure" from the menu 

at left brings up Fig. 28A. 

is [0117] Figure 28A is a screen showing the genomic structure of the ADBR2 gene (showing the location of features 
of the gene, such as promoters, exons, introns, 5' and 3' untranslated regions), polymorphism and haplotype informa- 
tion, and the number of times each haplotype was seen in the representatives of each of 4 world population groups. 
The column "Wild" contains the number of individuals homozygous for the more common nucleotide at each polymor- 
phic site, "Mut" contains the number homozygous for the less common nucleotide, and "Het" Is the number of hetero- 

20 zygous individuals. Overlaid on the two graphical gene representations at the upper part of the screen are vertical 
bars, indicating the positions of the polymorphic sites elaborated in the middle box. The user may scroll through the 
lower boxes to bring different portions of the polymorphism and haplotype data into view. Selecting row 6 in the middle 
window results in Figure 28B, 

[0118] Figure 28B is a screen where a particular polymorphic site has been selected in the middle box. The upper 
25 graphical representation of the gene has been replaced by a textual representation, presented as a nucleotide se- 
quence aligned with the lower graphical representation at the point of the selected polymorphic site (indicated by the 
black triangles). At the polymorphic site, the two observed nucleotides (T and C) are displayed, Selecting "Patient 
table" from the menu at left brings up Fig. 29A. 

[0119] Figure 29A presents genealogical information and diplotype and haplotype data for individuals within the 
30 database. Shaded rectangles within the table represent missing data. Within the rectangles and ovals are the ID num- 
bers of the individuals; below each of these in the upper genealogical chart are the two haplotypes of the ADBR2 gene 
present in that individual, identified by number. The nucleotides comprising these haplotypes are displayed in the box 
at the lower right. Selecting "Clinical Trial Data" from the menu at left brings up Fig. 29B. 

[0120] Figure 29B presents the clinical data sorted by individual patient. Severity scores, Skin Test results, and the 

35 clinically measured parameters described elsewhere are set out in columns. "NP" stands for "No data Point", and 
represents data missing for any reason. Selecting "HAPSNP" from the menu at left brings up Fig. 30. 
[0121] Figure 30 presents, for each patient, a row of color-coded (or shaded) squares representing the heterozygosity 
of the patient at each polymorphic site. These are adjacent to a row of split squares, where the same information is 
presented in a two-color (or shaded) format. Selecting the HAPPaircommand from the menu at the left brings up Fig. 31 . 

40 [0122] Figure 31 presents the "HAP Pair Frequency View" in which the world population distribution of haplotype or 
sub-haplotype pairs can be investigated. In this window, polymorphic sites 3, 9, and 11 have been selected by checking 
the corresponding boxes above the haplotypes. Each cell in the matrix below corresponds to a haplotype pair identified 
by the HAP numbers on the x and y axes. The height of the color-coded (or shaded) bars within each cell corresponds 
to the number of individuals of each population group having that haplotype pair. Clicking on the V/D button at the top 

'^s of the screen toggles between Fig. 31 and 32. 

[0123] Figure 32 shows the same data in tabular form. In this figure all SNPs have been selected, so the haplotypes 
being evaluated consist of thirteen polymorphic sites. Each row in the table corresponds to a haplotype pair (the two 
haplotypes which comprise the pair are identified in the first two columns), followed by the number of individuals in the 
database having that pair, and the percentage of the total population this number represents. Under each population 

50 group three columns presenting the number of individuals In the population group with that pair, the percentage of the 
population group that has that pair, and the percentage predicted by Hardy-Weinberg equilibrium. Selecting "Linkage" 
from the menu at left brings up Fig. 33. 

[0124] Figure 33 displays separate matrices forthe total population and for each population group. Each cell Is color- 
coded (or shaded) to indicate the extent to which the two haplotypes occur together in individuals, i.e., the degree to 
55 which they are linked. Selecting "HAPTyping" from the menu at left brings up the screen in Fig. 34. 

[0125] Figure 34 presents the ambiguity scores that result from masking one or more SNPs or polymorphisms in the 
genotype. The ambiguity scores are calculated by taking the sum of the geometric means of all pairs of genotypes 
rendered ambiguous by the mask, and multiplying by ten. All population groups have been chosen for inclusion in this 
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figure by checking off the boxes at the upper left of the screen. The list of haplotype pairs has been sorted by the 
calculated Hardy-Weinberg frequency, and the pairs have been numbered consecutively, as shown in the first column. 
[0126] A mask that causes SNP 8 to be ignored in all cases has been imposed by deselecting the appropriate box 
In the "Choose SNP" row above the haplotype list. Additional masking has been imposed by deselecting the appropriate 

5 boxes In the mask to the right of the Genotype table. (The mask is to the right of the table and may be accessed by 
scrolling horizontally; In the figure it has been relocated to bring it Into view.) In the first mask, only SNP 8 is Ignored, 
which results in haplotype pairs 4 and 73 both being consistent with the genotype observed. (In other words, the 
genotypes derived from haplotype pairs 4 and 73 differ only at SNP 8, and cannot be distinguished if It Is not measured). 
An ambiguity score of 0.01 6 is associated with this first mask. The frequency of haplotype pair 4 is much greater than 

10 that of haplotype pair 73 (recall that the list is sorted by frequency), so one could resolve this ambiguity with some 
confidence simply by choosing haplotype pair 4. (In an alternative embodiment, the probability of each choice being 
the correct one could be displayed.) For the present application, in general, the mask with the largest number of ignored 
SNPs that retains an ambiguity score of about 1 .0 or less will be preferred. The ambiguity score cut-off that is chosen 
may vary depending on the intended use of the inferred haplotypes. For example, If haplotype pair information Is to be 

is used In prescribing a drug, and certain haplotype pairs arc associated with severe side effects, the acceptable ambiguity 
score may be reduced. In such a situation masks that do not render the haplotype pairs of interest ambiguous would 
be preferred as well. Selecting "Phylogenetic" from the menu at left brings up Fig. 35. 

[0127] Figure 35 presents haplotype data in a phylogenetic minimal spanning network. Each disk corresponds to a 
haplotype, the haplotype number Is to the immediate right of each disk. The size of each disk Is proportional to the 

20 number of individuals having that haplotype; that number is displayed in parentheses to the right of each disk. Haplo- 
types that are closely related, that is they differ at only one polymorphic site, are connected by solid lines. Haplotypes 
that differ at two sites are connected by light lines, and are spaced farther apart. The colored (or shaded) wedges 
represent the fraction of Individuals having that haplotype that are from different population groups. Selecting "Clinical 
Haplotype Correlation" brings up the screen In Fig. 36. 

25 [0128] Figure 36 presents the association between a cilnicai outcome value (In this case, "delta %FEV1 pred" which 
Is the change in FEV1 observed after administration of albuterol, corrected for size, age, and gender The SNPs one 
wishes to test for association may be selected by checking off the appropriate box above the HAP list table. The value 
of delta %FEV1 Is represented In grayscale or by a color scale. Each cell in the matrix corresponds to a given haplotype 
pair, defined by the haplotype numbers on the x and y axes. The number in each cell is the number of patients having 

30 that haplotype pair, and the color (or shading) of each cell reflects the response of those patients to albuterol. In this 
case, groups of people with haplotype pairs shown In the red (or darkly shaded) boxes have the highest average 
response, e.g. haplotype pairs 3,4 and 3,5. (See also Fig. 41 , which presents numerical results showing that individuals 
with these haplotype pairs have a high average response to albuterol.) Under the "Clinical Mode" menu heading at the 
top of the screen is a command that the user may use to toggle among Figs. 36, 37, 38, and 40. 

35 [0129] Switching to Fig. 37 in this manner displays a collection of histograms, one in each cell of a haplotype pair 
matrix. Selecting the 1 ,1 cell enlarges it, bringing up Fig. 38. 

[0130] Figure 38 Is a histogram showing the number of individuals having the 1 ,1 haplotype pair who exhibited the 
response to albuterol shown on the x axis. The bars in the histogram are color-coded (or shaded) as well, as an 
additional indication of the degree of response. 

40 [0131] In either Fig. 36 or Fig. 37, there is a button with an icon of a small scatter plot (just below the Help menu at 
the top of the screen.) Selecting this button brings up Fig. 39A. This figure displays the regression calculations employed 
in the multi-SNP analysis, or "Build-up" process. Given the confidence values shown, which are the default values for 
the "tight cutoff" and "loose cutoff', the program generates pairwise combinations of SNPs, tests their p-values for 
correlation with "delta %FEV1 pred" against the cutoff values, and, from those subhaplotypes that pass the cut-offs, 

45 re-calculates and tests new painn/ise combinations, until the number of SNPs In the subhaplotypes reaches the limit 
shown In the "Fixed Site" box. In the example shown, no four-SNP subhaplotype passed the loose cutoff, thus there 
are only 1 -, 2-, and 3-SNP sub-haplotypes shown In this screen. New values may be entered in the Confidence and 
Fixed site fields; clicking on the calculator button (under the File menu) re-executes the Build-up and Build-down proc- 
esses with the entered values. 

50 [0132] A reverse SNP analysis, or "Build down" process, may also be carried out; the presence of the minus sign in 
the "Fixed Site" box indicates that this process is being requested. (In the example given, only a single "Build-down" 
round was executed, so as to ensure that the full haplotype is present for comparison.) 

[0133] For each "mari<er" (SNP, subhaplotype, orhaplotype) inthe left column, a regression analysis of the correlation 
of the number of copies of that marker with the value of "delta %FEV1 pred" is generated, and selected statistical 
55 Information is presented in the columns to the right. (A negative correlation coefficient (R) indicates that response to 
albuterol decreases with increasing copy number of the indicated marker.) The SNPs or subhaplotypes exhibiting the 
lowest p values are identified as the ones that should most preferably be measured in patients in order to predict 
response to albuterol. Selecting the box to the left of the •*a*****a*G** sub-haplotype brings up Fig. 39B. 
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[0134] Figure 39B presents in a graphic form the calculation of the regression parameters displayed in Fig. 39A.The 
values of "delta %FEV1 pred" for patients with 0,1, and 2 copies of the **A*****A*G**subhaplotype are plotted vertically 
at three ordinates. A line is drawn through the three means, and the slope of the line is tal<en as an Indication of the 
degree of correlation. The Intercept, slope, slope range, R and R2 values, and the p value associated with this line, 

5 are aii listed In Fig. 39A. The "slope range" Is a pair of limits, reflecting the standard deviation in the values of "delta 
%FEV1 pred". Mathematically, the p value listed in Fig. 39A is the probability that the slope is actually zero, i.e. it is 
the probability that there Is In fact no correlation. A lower value of p thus indicates greater reliability 
[0135] Fig. 40 (reached through the "Clinical Mode" menu) displays the observed hapiotype pairs, their distribution 
In the population, and the mean clinical response (delta %FEV 1 pred.) of the patients having those hapiotype pairs. 

10 Selecting the "normal" button (to the right of the scatter plot button) brings up Fig. 41 . 

[0136] Figure 41 shows a screen that displays the results of an ANOVA calculation in which patients were grouped 
according to hapiotype pairs, and the average value of "delta %FEV1 pred." was analyzed both within the groups and 
between the groups. This permits one to determine which pairs of haplotypes are associated with the observed clinical 
response. All SNPs in the ADBR2 gene have been selected in the row of boxes labeled "Choose SNPs", thus the 

is groups are the same as the cells In the matrix in Fig, 36, Groups containing one patient were ignored, leaving the 
seven groups listed at the bottom of the screen. This left six degrees of freedom (the parameter "DF") for inter-group 
comparisons. The variation ("Mean Squares") is larger between groups than within groups, and the ratio of the two (F- 
ratio) isgreaterthan one. (A large F-ratio Indicates that the Independent varlable-the hapiotype pairgroup- Is correlated 
with the response.) There Is a significant difference (p = 0.027) between the mean square value of the clinical response 

20 between groups compared to that within groups, it Is found In this example that being homozygous for hapiotype 3 
results In a significantly lower response (average 8.5%), while individuals with hapiotype pair 3,4 (i.e., GCACCTT- 
TACGCC and GCGCCTTTGCACA) show a good response to albuterol (average delta %FEV1 pred = 19.25%). This 
information is displayed in a more visual presentation in Fig. 36. 

[0137] Figure 42 is arrived at by selecting the "ClinicalVariables" command from the menu to the left of most of the 
25 previous screens. This is the same information displayed in Fig. 38, except that it is for the entire cohort rather than 
for a selected hapiotype pair, The number of patients is plotted against the value of "delta %FEV1 pred", Note the 
outliers at 50% and 65% response. Selecting "ClinlcalCorrelations" from the menu to the left brings up Fig. 43. 
[0138] Figure 43 is a plot of each patient's "FEV1% PRE" (the normalized value of FEV1 prior to administration of 
albuterol) against "delta %FEV1 pred". These variables are selected in the upper part of the screen, it is seen in this 
30 example that the response does not correlate with the initial value of FEV1 . 

D. IMPROVED METHODS 

1 . Improved Method For Finding Optimal Genotyping Sites 

35 

[0139] This aspect of the invention provides a method for determining an individual person's haplotypes for any gene 
with reduced cost and effort. A hapiotype is the specific form of the gene that the individual inherited from either mother 
or father. The 2 copies of the gene (one maternal and one paternal) usually differ at a few positions in the DNA locus 
of the gene. These positions are called polymorphisms or Single Nucleotide Polymorphisms (SNPs). The minimal 

40 information required to specify the hapiotype is the reference sequence, and the set of sites where differences occur 
among people in a population, and nucleotides at those sites for a given copy of the gene possessed by the individual. 
For the rest of this discussion, we assume that the reference sequence is given, and we represent the hapiotype as a 
string of letters specifying the nucleotides at the variable sites, in almost aii cases, only two of the possible 4 nucleotides 
will occur at any position (e.g. A or T, C orG), so for generality we can represent the two values for alleles as 1 and 0. 

45 Therefore a hapiotype can be represented as a string of Is and Os such as 001010100. in practicing this invention, 
one may mal<e use of known methods for discovering a representative set of the haplotypes that exist in a population, 
as well as their frequencies. One begins by sequencing large sections of the gene locus in a representative set of 
members in the population. This provides (1 ) a determination of aii of thesltes of variation, and (2) the mixed (unphased) 
genotype for each individual at each site. For instance In a sample of 4 individuals for a gene with 3 variable sites, the 

50 mixed genotypes could be: 



Individual 


Genotype site 1 


Genotype site 2 


Genotype site 3 


Hapiotype of 1=* 

allele 


Hapiotype of 2"=' 
allele 








MO 


3 




2 


0/0 


0/0 


0/0 


1 


1 


3 


1/0 


1/0 


0/0 


1 


2 
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(continued) 



Individual 


Genotype site 1 


Genotype site 2 


Genotype site 3 


Haplotype of 1=* 
allele 


Haplotype of 2"=* 
allele 


4 


1/1 


0/0 


1/0 


3 


5 



[0140] This mixed set of genotypes could be derived from the following haplotypes: 



Haplotype No. 


Haplotype 


Frequency in population 




000 


3 


2 


110 


1 


3 


100 


2 


4 


111 


1 


5 


101 


1 



[01 41 ] A method for deriving the haplotypes from the genotypes is described in a separate patent filing. 
[0142] The haplotypes are a fundamental unit of human evolution and their relationships can be described in terms 
of phylogenetlcs. One consequence of this phylogenetic relationship Is the property of linkage disequilibrium. Basically 
this means that If one measures a nucleotide at one site In a haplotype, one can often predict the nucleotide that will 
exist at another site without having to measure it. This predictability is the basis of this aspect of the invention. Elimi- 
nation of sites that do not need to be measured results in a reduced set of sites to be measured. 
[0143] Information from a previously measured set of Indivlduais (who were measured at all sites) may be used to 
determine the minimum number (or a reduced number) of sites that need to be measured in a new Individual In order 
to predict the new Individual's haplotypes with a desired ievel of confidence. Since the measurement at each site is 
expensive, the invention can lead to great cost reduction in the haplotyping process. 
[0144] Step 1 : Measure the full genotypes of a representative cohort of individuals. 
[0145] Step 2: Determine their haplotypes directly, or indirectly )(e.g., using one of several algorithms. 
[0146] Step 3: Tabulate the frequencies for each of these haplotypes. 

[0147] Note that Steps 1-3 are optional. The remaining steps only require that a database of haplotypes with fre- 
quencies exists. There are several ways to achieve this, but the above set of steps is the prefen-ed route. 
[0148] Step 4: Construct the list of all full genotypes that could come from the observed haplotypes. Note that only 
a subset of these will actually be observed In atypical sample, for example 100-200 Individuals 
[0149] Step 5: Predict the frequency of these genotypes from the Hardy- Weinberg equilibnum. If two haplotypes 
Hapl and Hap2 have frequencies f1 and f2, the expected frequency of the mix is 2 x f1 x f2, orfl x f2 if Hapl and 
Hap2 are idenlical. 

[0150] Step 6: Go through this list and find all sites that, if they were not measured, would still allow one to correctly 
determine each pair of haplotypes. For example, take the case where the three haplotypes A (1111), B (1110), and C 
(0000) exist in a population. The six genotypes that could be observed are derived from the six different pairs that are 
possible: 





Hap Polymorphic Site 


Pair 




2 


3 


4 


1. 


A,A 


1/1 


1/1 


1/1 


1/1 


2. 


A,B 


1/1 


1/1 


1/1 


1/0 


3. 


A,C 


1/0 


1/0 


1/0 


1/0 


4. 


B,B 








0/0 


5. 


B,C 


1/0 


1/0 


1/0 


0/0 


6. 


C,C 


0/0 


0/0 


0/0 


0/0 



55 [0151] Not measuring any one of the sites 1-3 wouid still permit one to con-ectly assign a haplotype pair to an indi- 
vidual. From this we can see that any one of the first three positions, togetherwith thefourth, carries all of the information 
required to determine which pair of haplotypes an individual has. 

[0152] Step 7: Extend the analysis of Step 6 as follows. Create a set of masks of the same length as the haplotype. 
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A mask may be represented by a series of letters, e.g., Y for yes and N for no, to indicate whether the marked site is 
to be measured. For example, using the mask YNNY in the previous example, one would measure only sites 1 and 4, 
and one could use the information that only haplotypes 1111, 1110, and 0000 exist to infer the haplotypes for the 
individuals. Masks NYNY and NNYY would give equivalent infonnation. If there are n sites, all combinations of Y and 

5 N produce 2" masks, of which 2"-1 need to be examined (the all-N mask provides no infomnation). 

[0153] Step 8: For each mask, evaluate how much ambiguity exists from this measurement of incomplete information. 
For example, one measure of ambiguity would be to take all pairs of genotypes that are identical when using the mask, 
and multiply their frequencies. The product may be converted to the geometric mean. Then, for each mask, add up all 
such products for all ambiguous pairs to obtain an ambiguity score, which is used as a penalty factor in evaluating the 

10 value of the mask. The consequence of this would be to highly penalize masks that fail to resolve likely-to-be-seen 
genotypes into correct haplotypes, and masks that leave large numbers of genotypes ambiguous, such as the mask 
NNNY in the above example. This would give greater weight to masks that only confuse low frequency, low probability 
genotypes. A variety of other scoring schemes could be devised for this purpose. 

[0154] This approach is most preferably implemented by means of a computer program that allows a user to view 
is the ambiguity scoro for each mask, and calculate the tradeoff between reduced cost and reduced certainty in the 
determinalion of Ihe haplotypes. 

[0155] Step 8: Genotype new individuals using the optimal set of m sites (the optimal mask). In the example above, 
there are three equivalent optimal masks, YNNY, NYNY and NNYY, which require that only two of the four polymorphic 
sites be measured. (These masks have zero ambiguity.) 
20 [0156] Step 9: Derive these individuals' full n-site haplotypes by matching their m-site genotypes to the appropriate 
m-site genotypes derived from the n-site haplotypes of the initial cohort. If there is an ambiguity in the choice, the more 
common haplotype may be chosen, but preferably a haplotype pair will be chosen based on a weighted probability 
method as follows; 

If two haplotype pairs A and B exist that could explain a given genotype, the Hardy-Weinberg equilibrium will 
25 predict probabilities p^^ and pg, where p^^ + pg = 1 . One chooses a random number between 0 and 1 . If the number is 
less than or equal to p^, the first haplotype pair A is assumed. If the number is greater than p^, the second pair is 
assumed, There are more complex variants of this algorithm, but this simple, unbiased approach Is preferred. 

2. Impreved Methods For Correlating Haplotypes With Clinical Outcome Varlable(s) 

30 

[0157] The following methods are described for correlating haplotypes, or haplotype pairs, with a clinical outcome 
variable. However, these methods are applicable to correlating haplotypes, and/or haplotype pairs, to any phenotype 
of interest, and is not limited to a clinical population or to applications in a clinical setting. 

35 a. Multi-SNP Analysis Method (Bulld-Up Process) 

[0158] This process is outlined in the flow chart shown in Figure 45. The first step (81 ) is the collection of haplotype 
information and clinical data from a cohort of subjects. Clinical data may be acquired before, during, or after collection 
of the haplotype information. The clinical data may be the diagnosis of a disease state, a response to an administered 
40 drug, a side-effect of an administered drug, or other manifestation of a phenotype of interest for which the practitioner 
desires to determine correlated haplotypes. The data is referred to as "clinical outcome values." These values may be 
binary (e.g., response/no response, survival at 5 months, toxiclty/no toxicity, etc.) or may be continuous (e.g. liver 
enzyme levels, serum concentrations, drug half-life, etc.) 

[0159] The collection of haplotype information is the determination (e.g., by direct sequencing or by statistical infer- 
os ence) of a pattern of SNPs for each allele of a pre-selected gene or group of genes, for each individual in the cohort. 
The gene or group of genes selected may be chosen based on any criteria the practitioner desires to employ. For 
example, if the haplotype data is being collected in orderto build a general-purpose haplotype database, a large number 
of clinically and pharmacologically relevant genes are likely to be selected. Where a retrospective analysis of a cohort 
from an ongoing or completed clinical study is being carried out, a smaller number of genes judged to be relevant might 
50 be selected, 

[0160] The next step (S2) is the finding of single SNP correlations. Each individual SNP is statistically analyzed for 
the degree to which it con-elates with the phenotype of interest. The analysis may be any of several types, such as a 
regression analysis (correlating the number of occurrences of the SNP in the subject's genome, i.e. 0, 1 , or 2, with the 
value of the clinical measurement), ANOVA analysis (correlating a continuous clinical outcome value with the presence 
55 of the SN P, relative to the outcome value of individuals lacking the SNP), or case-control chi-square analysis (correlating 
a binary clinical outcome value with the presence of the SNP, relative to the outcome value of individuals lacking the 
SNP). 

[0161] In one embodiment, a "tight cut-off' criterion is next applied to each SNP in turn. A first SNP is selected (S3) 
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and its correlation with the clinical outcome is tested against a tight cut-off (S4). A typical value for the tight cut-off will 
be in the range p = .01 to .05, although other values may be chosen on empirical or theoretical grounds. If the SNP 
correlation meets the tight cut-off it Is displayed to the user of the system (S5) (or, alternatively, stored for later display), 
and stored for later combination (S6). If the SNP correlation does not meet the tight cut-off it is tested against a "loose 

5 cut-off" (S7), typically in the range p = .05 to 0.1. Again, other cut-off values may be chosen if desired for any reason. 
(User-selected tight and loose cut-off values are entered in the two boxes labeled "confidence" In Fig. 39a.) A SNP 
whose correlation meets the loose cut-off is stored for later combination (S6). Any SNP whose correlation does not 
meet either cut-off Is discarded (88), i.e., it Is not considered further In the process. If there are SNPs remaining to be 
tested against the cut-offs (89) they are selected (81 0) and tested (84) in turn. 

10 [0162] In an alternative embodiment, alight cut-off is not applied, and each SNP's correlation is tested directly against 
the loose cut-off, and the SNP is either saved or discarded. In this embodiment, correlations of pair-wise generated 
sub-haplotypes (see below) are also tested directly against the loose cut-off. If desired, SNPs and sub-haplotypes 
which are saved at the end of this alternative process may be measured against a tight cut-off, and those that pass 
may be displayed. 

is [0163] When all SNPs have had their correlations tested, the next step of the process consists of generating all 
possible pair-wise combinations (subhaplotypes) of the saved SNPs, If novel [i.e. untested) sub-haplotypes are pos- 
sible (S11), which will bethecaseon the first iteration, they are generated by pair-wise combination of all saved SNPs 
(S12). The correlations of the newly generated subhaplotypes with the clinical outcome values are calculated (S13), 
as was done for the SNPs. A first sub-haplotype is selected (S 1 5) and its correlation Is tested against the tight and 

20 loose cut-offs (84, 87) as described above for the SNP correlations. Each sub-haplotype is tested in turn, as described 
above, discarding any subhaplotypes that do not pass the cut-off criteria and saving those that do pass. 
[0164] When all sub-haplotypes have been examined, the process generates new pair-wise combinations among 
the originally saved SNPs and the newly saved sub-haplotypes, and among all saved sub-haplotypes as well, The 
process may be Iterated until no new combinations are being generated: alternatively the practitioner may interrupt 

25 the process at any time. In a preferred embodiment, the practitioner may set a limit to the number of SNPs permitted 
in the generated sub-haplotypes. (See Fig. 39a, where "fixed site = 4" Isa4-SNP limit). In this embodiment the system 
would then determine If new combinations within the limit are possible prior to each pairwise combination step. 
[0165] In a preferred embodiment, complex redundant subhaplotypes are removed from the pair-wise generated 
sub-haplotypes (S14). Complex redundant sub-haplotypes are those which are constructed from smaller sub-haplo- 

30 types, where the smaller sub-haplotypes have correlation values that are at least as significant as that of the complex 
sub-haplotype, i.e. they have correlation values that account for the correlation value of the complex redundant sub- 
haplotype. In such cases the complex haplotype provides no additional information beyond what the component sub- 
haplotypes provide, which makes it redundant. The non-redundant haplotypes and sub-haplotypes that remain are 
those that have the strongest association with the clinical outcome values. These are saved for future use (816). 

35 

b. Reverse SNP Analysis Method (Pare-Down Process) 

[01 66] This aspect of the invention provides a method for discovering which particular SNPs or sub-haplotypes cor- 
relate with a phenotype of interest, when one has in hand single gene haplotype correlation values. The process is 

40 outlined In the flow chart illustrated in Fig. 46. 

[0167] Thefirststep (SI 7) is the collection of haplotype information and clinical data from a cohort of subjects. Clinical 
data may be acquired before, during, or after collection of the haplotype Information. The clinical data may be the 
diagnosis of a disease state, a response to an administered drug, a side-effect of an administered drug, or other 
manifestation of a phenotype of interest for which the practitioner desires to determine correlated haplotypes. The data 

fs is referred to as "clinical outcome values." These values may be binary (e.g., response/no response, survival at 5 
months, toxiclty/no toxicity, etc.) or may be continuous (e.g. liver enzyme levels, serum concentrations, drug half-life, 
etc.) 

[0168] The collection of haplotype information is the determination (e.g., by direct sequencing or by statistical infer- 
ence) of a pattern of SNPs for each allele of each of a pre-selected group of genes, for each individual in the cohort. 
50 The group of genes selected may be chosen based on any criteria the practitioner desires to employ. For example, if 
the haplotype data is being collected in orderto build a general-purpose haplotype database, a large number of clinically 
and pharmacologically relevant genes are Nicely to be selected. Where a retrospective analysis of a cohort from an 
ongoing or completed clinical study is being carried out, a smaller number of genes judged to be relevant might be 
selected. 

55 [0169] The next step (S18) Is the finding of single-gene haplotype correlations. Each individual haplotype of each 
gene is statistically analyzed for the degree to which it correlates with the phenotype or clinical outcome value of 
Interest. The analysis may be any of several types, such as a regression analysis (correlating the number of occurrences 
of the haplotype in the subject's genome, i.e. 0, 1 , or 2, with the value of the clinical measurement), ANOVA analysis 
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(correlating a continuous clinical outcome value with the presence of the haplotype. relative to the outcome value of 
individuals lacl<ingthe haplotype), or case-control chi-square analysis (correlating a binary clinical outcome value with 
the presence of the haploptype, relative to the outcome value of Individuals lacking the haplotype). 
[0170] In one embodiment, a "tight cut-off" criterion is next applied to each haplotype in turn. A first haplotype is 

5 selected (S1 9) and its correlation with the clinical outcome value is tested against a tight cut-off (S20). A typical value 
for the tight cut-off will be in the range p = .01 to .05, although other values may be chosen on empirical or theoretical 
grounds. If the haplotype correlation meets the tight cut-off it is displayed to the user of the system (S21 ) (or, alterna- 
tively, stored for later display), and stored for later combination (S22). If the haplotype correlation does not meet the 
tight cut-off it is tested against a "loose cut-off" (S23), typically in the range p = .05 to 0.1 . Again, other cut-off values 

10 may be chosen if desired for any reason. A haplotype meeting the loose cut-off is stored for later combination (S22). 
Any haplotype whose correlation does not meet either cut-off is discarded (S24), i.e., it is not considered further in the 
process. If there are haplotypes remaining to be tested against the cut-offs (S25) they are selected (S26) and tested 
(S20) In tum. 

[0171] In an alternative embodiment, a tight cut-off is not applied. The correlation of each haplotype Is tested directly 
15 against the loose cut-off, and the haplotype is either saved or discarded. In this embodiment, correlations of subhap- 
lotypes generated by masking (see below) are also tested directly against the loose cut-off. If desired, sub-haplotypes 
which are saved at the end of this alternative process may be measured against a tight cut-off, and those that pass 
may be displayed. 

[0172] When all haplotypes have had their con-elations tested, the next step of the process consists of generating 
20 all possible sub-haplotypes in which a single SNP is masked, i.e. its identity is disregarded. If novel {i.e. untested) 
subhaplotypes are possible (S27), which will be the case on the first iteration, they are generated by systematically 
masking each SNP of all saved haplotypes (S28). The correlations of the newly generated sub-haplotypes with the 
cllnlcaloutcom6valu6arecalculated(S29), as was doneforthe haplotypes themselves. A first subhaplotype Is selected 
(S30) and its correlation is tested against the tight and loose cut-offs (S20, S23) as described above for the haplotype 
25 correlations. Each subhaplotype is tested in turn, as described above, discarding any sub-haplotypes that do not pass 
the cut-off criteria and saving those that do pass. 

[0173] Optionally in a preferred embodiment, complex redundant haplotypes and sub-haplotypes are discarded after 
correlations are calculated for the sub-haplotypes and SNPs generated by the masking step (S31 ). Complex redundant 
haplotypes and sub-haplotypes are those which are constructed from smaller subhaplotypes or SNPs, where the small- 
30 er sub-haplotypes or SNPs have correlation values that are at least as significant as that of the complex sub-haplotype, 
i.e. they have correlation values that account for the correlation value of the complex redundant sub-haplotype. In such 
cases the complex haplotype or sub-haplotype provides no additional infonnatlon beyond what its component sub- 
haplotypes or SNPs provide, which makes it redundant. 

[0174] When all sub-haplotypes have been examined, the process generates new sub-haplotypes by masking SNPs 
35 among the newly saved subhaplotypes. The process is preferably iterated until no new sub-haplotypes are being 
generated; this may occur only when the sub-haplotypes have been reduced to individual SNPs. Alternatively the 
practitioner may interrupt the process at any time. 

[0175] The non-redundant sub-haplotypes and SNPs that remain are those that have the strongest association with 
the clinical outcome values. These are saved for future use (S32). 

40 

E. TOOLS OF THE INVENTION 

[0176] The methods of the invention preferably use a tool called the DecoGen^" Application. 
[0177] The tool consists of: 

45 

a. One or more databases that contain (1) haplotypes for a gene (or other loci) for many individuals (I.e., people 
for the CTS"""" method application, but it would include animals, plants, etc, for other applications) for one or more 
genes and (2) a list of phenotypic measurements or outcomes that can be but are not limited to: disease meas- 
urements, drug response measurements, plant yields, plant disease resistance, plant drought resistance, plant 

50 interaction with pest-management strategies, etc. The databases could include information generated either in- 

ternally or externally (e.g. GenBank). 

b. A set of computer programs that analyze and display the relationships between the haplotypes for an individual 
and its phenotypic characteristics (including drug responses). 

55 [0178] Specific aspects of the tool which are novel include: 

a. A method of displaying measurements (such as quantitative phenotypic responses) for groups of individuals 
with the same group of haplotypes or sub-haplotypes, and thereby easily showing how responses segregate by 
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haplotype or sub-haplotype composition. In the example herein, the display shows a matrix where the rows are 
labeled by one haplotype and the columns by a second. Each cell of the matrix is labeled either by numbers, by 
colors representing numbers, by a graph representing a distribution of values for the group or by other graphical 
controls that allow for further data mining for that group. 
5 b. A minimal spanning tree display ( see , e.g., Ref. 8) showing the phylogenetic distance between haplotypes. Each 

node, which represents a haplotype, is labeled by a graphic that shows statistics about the haplotype (for example, 
fraction of the population, contribution to disease susceptibility). 

c. Numerical modeling tools that produce a quantitative model linking the haplotype structure with any specific 
phenotypic outcome, which Is preferably quantitative or categorical. Examples of outcomes Includeyears of survival 

10 after treatment with anticancer drugs and increase in lung capacity after tal<ing an asthma medication. This model 

can use a genetic algorithm or other suitable optimization algorithm to find the most predictive models. This can 
be extended to multiple genes using the current method (see Equation 5). Techniques such as Factor Analysis 
(Ref. 4, Chapter 14) could be used to find the minimal set of predictive haplotypes. 

d. A genotype-to-haplotype method that allows the userto find the smallest number of sites to genotype in order 
is to Infer an individual's haplotypes or sub-haplotypes for a given gene. An individual's haplotypes provide unam- 
biguous knowledge of his genetic makeup and hence of the protein variations that person possesses. As described 
earlier, the individual's genotype does not distinguish his haplotypes so there is ambiguity about what protein 
variants the individual will express. However, using current technology, It is much more expensive to directly hap- 
lotype an Individual than it Is to genotype him. The method described above allows one to predict an individual's 

20 haplotypes, and therefore to make use of the predictive haplotype-to-response correlation derived from a clinical 

trial. The steps required for this to work are (a) determine the haplotype frequencies from the reference population 
directly; (b) correct the observed frequencies to conform to Hardy-Weinberg equilibrium (unless it is determined 
that the derivation is not due to sampling bias as discussed above); and (c) use the statistical approach described 
in the third paragraph of item 6 above to predict individuals' haplotypes or sub-haplotypes from their genotypes. 

25 

F. DATAA)ATABASE MODEL 

[0179] The present invention uses a relational database which provides a robust, scalable and releasable datastor- 
age and data management mechanism. The computing hardware and software platforms, with 7x24 teams of database 
30 administration and development support, provide the relational database with advantageous guaranteed data quality, 
data security, and data availability. The database models of the present Invention provide tables and their relationships 
optimized for efficiently storing and searching genomic and clinical infomiation, and othenvise utilizing a genomics- 
oriented database. 

[01 80] A data model (or database model) describes the data fields one wishes to store and the relationships between 
35 those data fields. The model is a blueprint for the actual way that data is stored, but is generic enough that it is not 
restricted to a particular database implementation (e.g., Sybase or Oracle). In the preferred embodiment of the present 
invention, the model stores the data required by the DecoGen application. 

1. Database Model Version 1 



[0181] In one embodiment, the database comprises 5 submodels which contain logically related subsets of the data. 
These are described below. 

45 

1. Gene Repository (Fig. 25A): This submodel describes the gene loci and its related domains. It captures the 

information on gene, gene structure, species, gene map, gene family, therapeutic applications of genes, gene 
naming conventions and publication literature including the patent information on these objects. 

2. Population Repository (Fig. 25B): This submodel encapsulates the patient and population information. It cov- 
50 ers entities such as patient, ethnic and geographical background of patient and population, medical conditions of 

the patients, family and pedigree information of the patients, patient haplotype and polymorphism information and 
their clinical trial outcomes. 

3. Polymorphism Repository (Fig. 25C): This submodel stores the haplotypes and the polymorphisms associated 
with genes and patient cohorts used in clinical trials. The polymorphisms may include SNPs, small insertions/ 

55 deletions, large insertions/deletions, repeats, frame shifts and alternative splicing. 

4. Sequence Repository (Fig. 25D): Genetic sequence information in the form of genomic DNA, cDNA, mRNA 
and protein is captured by this data submodel. What is more important in this model is the location relationship 
between the gene structural features and the sequences. Patent information on sequences is also covered. 
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5. Assay Repository (Fig. 25E): This submodel captures client companies, contact infomiation, compounds used 
In the different disease areas and assay results for such compounds in regards to polymorphisms and haplotypes 
In target genes. 

5 [0182] A model or sub-model is a collection of database tables. A table is described by its columns, where there is 
one column for each data field. For instance the table COMPAIMY contains the following 3 columns: COMPANY_ID, 
COMPANY.NAIVIE, and DESCR. COMPANY.ID is a unique number (1, 2, 3, etc.) assigned to the company 
COMPANY_NAI\/IE holds the name (e.g., "Genaissance") and DESCR holds extra descriptive information about the 
company (e.g., "The HAP Company"). There will be one row in this table for each company for which data exists in 

10 the database. In this case COIVIPANY_ID is the "primary l<ey" which requires that no two companies have the same 
value of GOMPANY_ID, i.e., that it is unique in the table. Tables are connected together by "relationships". To under- 
stand this, refer to Figure 25E which shows the table COMPANYADDRESS. It has fields COIVIPANYJD, STREET 
CITY, etc. In this table the field COMPANYJD refers back to the table COMPANY. If a company has several locations, 
there will be several rows in the table COIVIPANYADDRESS, each with the same value of COIVIPANYJD. For each of 

15 these we can get the name and description of the company by referring back to the COMPANY TABLE. 

b. Abbreviations 

[0183] The following abbreviations are used in FIGURES 25A-E and the tables describing the database model de- 
20 picted therein: 



AA: 


amino acid 


Clin : 


clinical 


Descr : 


description 


FK: 


foreign key 


Geo : 


geographical 


Hap; 


Haplotype 


ID: 


identifier 


Loc : 


location 


Mol: 


molecule 


NT: 


nucleotide 


PK: 


primary key 


Poly: 


polymorphism 


Pos : 


position 


Pub: 


publication 


QC: 


quality control 


Seq : 


sequence 


SNP: 


single nucleotide polymorphii 


Therap : 


therapeutic 


c. Tables 





[0184] In this embodiment of the present invention, the database contains 76 tables as follows: 





Accession 


2) 


Assay 


3) 


Ass ay Result 


4) 


BioSequence 


5) 


ChromosomeMap 


50 6) 


ClasperClone 


7) 


ClinicalSite 


B) 


Company 


9) 


CompanyAddress 


10) 


Compound 


55 11) 


CompoundAssay 


12) 


Contact 


13) 


FamllyMember 


14) 


FamilyMemberEthnicity 
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15) Feature 

16) FeatureAccession 

17) FeatureGeneLocation 

18) Featurelnfo 
5 19) FeatureKey 

20) Featu relist 

21) Featu rePub 

22) Gene 

23) GeneAccession 
10 24) GeneAlias 

25) GeneFamily 

26) GeneMapLocation 

27) GenePathway 

28) GenePriority 
is 29) GenePub 

30) GenotypeCode 

31) Ethnicity 

32) HapAssay 

33) HapCompoundAssay 
20 34) HapHistory 

35) Haplotype 

36) HapMethod 

37) HapPatent 

38) HapPub 
25 39) HapSNP 

40) HapSNPHIstory 

41) LocationType 

42) MapType 

43) Method 

30 44) MoleculeType 

45) Nomenclature 

46) Patent 

47) Patentlmage 

48) Pathway 

35 49) PalhwayPub 

50) PolyMethod 

51) Polymorphism 

52) PolyNameAlias 

53) PolySeq3 
40 54) PolySeq5 

55) Publication 

56) SeqAccesslon 

57) SeqFeatureLocation 

58) SeqGeneLocation 
45 59) SeqSeqLocation 

60) SequenceText 

61) SNPAssay 

62) SNPPatent 

63) SNPPub 
50 64) Species 

65) Patient 

66) PatientCohort 

67) PatientEthnicity 

68) PatientHap 

55 69) PatientHapClinOutcome 

70) PatientHapHistory 

71) PatientMedicalHistory 

72) PatientSNP 
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73) PatientSNPHistory 

74) TherapetuicArea 

75) TherapeuticGene 

76) VariationType 

[0185] Additional tabies (not shown) may inciude Alleie, FeaturelVlapLocation, Pubimage, TherapCompound 



[01 86] Figures 25A-E show the fields of each table in the database. The following are descriptions of the fields found 
in the database as well as for fields and tables that could be added to the database: 



ACCESSION NOT NULL VARCHAR2(20) 



SOURCE 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE_nME 

Name 



VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 

• Type 



a unique ID for a 
sequence in the 
commonly used public 
domain databases; 
becomes de facto 
standard for sequence 
data access in the 
academia and lndust(y 
who issued the ID 
other descriptions 
who inserted the record 
when 

vMio updated the record 



ALLELE_NAME NOT NULL NUMBER(4) 



POLYJD 

NT_SECLTEXT 

AA_SEQ_TEXT 

DESCR 
INSERTED_BY 
. INS^RTTIME 
UPDATEO.BY 
UPDATE TIME 



NOT NULL NUMBER 

VARCHAR2(4000) 

VARCHAR2(1000) 

VARCHAR2(200) 
VARCHAR2(30) 

DATE 
VARCHAR2{30) 

DATE 



allele is the one member 
of a pair or series of 
genes that occupy a 
specific position on a 
specific chromosome 
Foreign l(ey to the 
polymorphism record 
Nucleotide sequence 

Amino acid sequence 
string 
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ASSAYJD 



ASSAY_NAME 

ASSAY_PARAMETERS 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED.BY 
UPDATE_flME 



VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Primary key for the 
assay table 



Name 



Null? 



Type 



ASSAYJD 

ASSAY_TYPE 

MEASURE 

TIMESTAMP 

OPERATOR 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE_TIME 



NOT NULL NUMBER 

VARCHAR2(100) 
VARCHAR2(200) 

DATE 

VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



measulrement of the 
assay parameters 
time of operation 
who did it 



Name 



Type 



SEQJD 

MOL_TYPE 

SEQ.LENGTH 

PATENTJD 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATEjriME 



NUMBER 
VARCHAR2(20) 
NUMBER 
NUMBER 
VARCHAR2(20O) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



sequence ID (PK) 
molecular type 
sequence length 
FK to the patent record 



Type 



MAPJD 

MAPJTrPEJD 

SPECIESJD 

CHROMOSOME 

MAP.NAME 

EXTERNAL_KEY 

KEY_SOURCE 
DESCR 

INSERTED.BY 



NOT NULL 
NOT NULL 
NOT NULL 



NUMBER(4) 

NUMBER{4) 

NUMBER 

VARCHAR2(2) 

VARCHAR2(50) 

VARCHAR2(50) 

VARCHAR2(20) 
VARCHAR2{200) 
VARCHAR2(30) 



unique genetic map ID 
FKtoMapType 
FK to species 



ID used by external 
sources 
which source 
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INSERT.TIME 
UPDATED_BY 
UPDATE_TIME 



DATE 

VARCHAR2(30) 
DATE 



Ubie 

ClaspeiClone 



CLASPER.CLONEJD NOT NULL NUMBER 



PI 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 



VARCHAR2(50) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



Unique ID for each 
Clasperdone 
Subject ID: it is the FK to 
Subject tat)le 



Type. 



CUNICAL_SITE_ID NOT NULL NUMBER(4) 



SITE_NAME 

COMPANYJD 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE_TIME 



VARCHAR2{50) 
NUMBER 
VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



COMPANYJD t 

COMPANY_NAME 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED.BY 
UPDATEjriME 



NUMBER 
VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(30} 
DATE 

VARCHAR2(30) 
DATE 



Type 



COMPANYJD 
CONTACTJD 

STREET 

CITY 

STATE 

COUNTRY 

ZIP 

WEB_SITE 
DESCR 

INSERTED_BY 
INSERTjriME 



NUMBER 

NUMBER 

VARCHAR2(50) 

VARCHAR2(50) 

VARCHAR2{50) 
VARCHAR2(100) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(200) 

VARCHAR2(30) 

DATE 
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UPDATED_BY 
UPDATE_flME 



VARCHAR2(30) 
DATE 



COMPOUNDJD NOT NULL NUMBER 



COMPANYJD 
THERAPJD 
PATENTJD 
REGISTRATION_NUM 



NUMBER 
NUMBER 
NUMBER 



Compound registration 
numt>er is generally the 
unique ID for the 



COMPOUND_NAME 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE_TIME 



VARCHAR2(200) 
VARCHAR2(20q) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Compound 
Assay 



Name 



ASSAY_ID 

DESCR 

INSERTED.BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.flME 



Type 



COMPOUNDJD NOT NULL NUMBER 



NOT NULL NUMBER 

VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



CONTACTJD 

COMPANYJD 

ADDRESSJD 

LAST_NAti7E 

MIDDLE_NAME 

FIRST_NAME 

OFFICE_PHONE 

EMAIL 

CELL_PHONE 

PAGER_PHONE 

FAX 

WEB_SITE 

DESCR 

INSERTED_BY 

INSERT.TIME 

UPDATED.BY 

UPDATE.TIME 



NUMBER 
NUMBER 
NUMBER 

VARCHAR2(50) 
VARCHAR2{20) 
VARCHAR2(50) 
VARCHAR2(20) 
VARCHAR2(100) 
VARCHAR2(20) 
VARCHAR2(20) 
VARCHAR2(20) 
VARCHAR2{200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 
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table 

Family Member 



PI NOT NULL VARCHAR2(50) 

FAMILY_POSmON NOT NULL VARCHAR2(20) 



DESCR 

INSERTED_BY 

INSERT.TIME 

UPDATED_BY 

UPDATE_flME 



VARCHAR2C200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



FK to Patient 
examples are sibblings, 
parents, grandparents. 



table 

Family Member 
Ethnicity 



Type 



PI NOT NULL VARCHAR2(50) 
FAMILY_POSmON NOT NULL VARCHAR2(20) 

ETHNIC_CODE NOT NULL VARCHAR2(20) 

DESCR VARCHAR2(200) 

INSERTED.BY VARCHAR2{30) 

INSERTjriME DATE 

UPDATED.BY VARCHAR2(30) 

UPDATEjriME DATE 



FK pointing to the 
Ethnidty table 



table 

Feature 

Accession 



Null? Type 
NOT NULL NUMBER 



FEATURE.NAME VARCHAR2(50) 
FEATURE.KEYJD NOT NULL NUMBER(3) 



MAPJD 

DESCR 

INSERTED.BY 

INSERT_TIME 

UPDATED.BY 

UPDATE_TIME 



ACCESSION 
FEATUREJD 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30} 
DATE 

VARCHAR2(30) 
DATE 

Type 



a feature is defined as 
either a genomic 
stnjcture of a gene, or a 
fragment of DMA on a 
chromosome in the 
genome. 

FK pointing to the Gene 
table in case of feature 
of a gene 

FK polntng to the 
FeatureKey table to 
allow only validated 
feature types 



NOT NULL 
NOT NULL 



VARCHAR2(20) 
NUMBER 
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table 

Feature 

GeneLocation 



END_POS 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 

Name 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 

Type 



the start position of the 
feature in the sequence 
identified by that 
accession 
the end position 



GENEJD 
LOC.TYPE 



FeatureKey 



FEATUREJD 
LOC_VALUE 


NOT NULL 


NUMBER 
NUMBER 


RANGE_FROM 




NUMBER 


RANGE_TO 




NUMBER 


DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 




VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 


Name 


Null? 


Type 


FEATUREJD 
QUALIFIER 


NOT NULL 
NOT NULL 


NUMBER 
VARCHAR2(50) 


DETAIL_VALUE 




VARCHAR2{2000) 


DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 




VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 


Name 


Null? 


Type 



FEATURE_KEYJD NOT NULL NUMBER(3) 
FEATURE_KEY VARCHAR2(20) 

SOURCE VARCHAR2(20) 



FK 

location type determines 
what type of structural 
relationship we are going 
to build in the particular 
case between the gene 
and the feature 
FK 

if the location type 

requires only one value, . 

here it goes 

if the location type is a 

range, then this is the 

start position 

and this is the end 



a fiee set of annotations 
to a feature 

the values of the qualifier 
annotation 



feature key validates the . 
feature types allowed 
who defined the key 



39 



EP 1 233 365 A2 



table 

FeatureList 



20 

table 

Feature Map 
Location 

25 



30 



40 



45 



table 
Gene 



INSERTED_BY 
INSERTTIME 
UPDATED_BY 
UPDATEJTIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



FEATUREJD 
ITEMJD 



DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE.TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



PK1 

PK2.Thlsst 
usediobirildthe 
relationship between 2 
features 



FEATUREJD NOT NULL NUMBER 
MAPJO NOT NULL NUMBER(4) 

MAP_LOCATION NUMBER 



DESCR 

INSERTED_BY 
INSERT_TIME 
UPOATED_BY 
UPDATE TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



PUBJD 
FEATUREJD 



DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE.flME 



NOT NULL NUMBER 
NOT NULL NUMBER 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



publication ID is the PK 
&FK 

SO is the feature ID. This 
table builds the many-to- 
many relationship 
between the tables of 
Publication and Feature 



Name 



Type 



NOT NULL NUMBER 



unique ID for a gene 
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GENE_SYMBOL NOT NULL 

GENE_FAMILYJD NUMBER 

SPECIESJD NOT NULL 

PATENTJD 

DESCR 
INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATEjriME 



NUMBER 

NUMBER 

VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



standardized gene 
symbols used in the 
most simplistic manner 
to refer to a gene 
the family cluster a gene 
belongs to 

the spedes which has 
this gene 

the patent assodaled 
with this gene 



Type 



GENEJD NOT NULL NUMBER 

ACCEisiON NOT NULL VARCHAR2(20) 

DESCR VARCHAR2(200) 

INSERTED_BY VARCHAR2(30) 

INSERT_TIME DATE 

UPDATED_BY VARCHAR2(30) 

UPDATE_flME DATE 

Name Null? Type 

GENEJD NOT NULL NUMBER 

ALIAS.NAME NOT NULL VARCHAR2(500) 



gene and the sequence 
association-through the 
unique accession 



DESCR 

1NSERTED_BY 

INSERT.TIME 

UPDATED_BY 

UPDATE.flME 



V/U^CHAR2[200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



GENE_FAMILYJD NOT NULL 

FAMILY_NAME 

DESCR 

INSERTED_BY . 
INSERT_TIME 
UPDATED_BY 
UPDATE.TIME 



NUMBER{4) 
VARCHAR2(50) 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 
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table 

GeneMap 

Location 



GENEJD 
IWAPJD 

MAP_LOCATION 

DESCR 

INSERTED_BY 

INSERTjriME 

UPDATED_BY 

UPDATE_TIME 



NOT NULL NUMBER 
NOT NULL NUMBER(4) 

NUMBER 
VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



genome map location 



Null? 



Type 



PATHWAYJD NOT NULL NUMBER(4) 



GENEJD 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 



NUMBER 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the biological pathway in 
which the gene plays a 



Type 



GENEJD NOT NULL NUMBER 

TASK_FORCE_NUM NUMBER(6) 

REX_PRIORITY VARCHAR2(5) 

NEW.PRIORITY VARCHAR2(5) 

REALM_PRIORrrY VARCHAR2{5) 

DESCR VARCHAR2(200) 

INSERTED.BY VARCHAR2(30) 

INSERT_TIME DATE 

UPDATED_BY VARCHAR2(30) 

UPDATE_TIME DATE 



internal info for gene 
project prioritization 



Null? 



Type 



PUBJD 

GENEJD 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE J-IME 



NOT NULL NUMBER 

NOT NULL NUMBER 

VARCHAR2(200) 
VARGHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



publications concerning 
a gene 
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GENOTYPE 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE_flME 



NOT NULL CHAR(1) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



genotyping code for the 
polymorphism 



ETHNIC_GROUP 



ETHNIC_CODE NOT NULL VARCHAR2(20) 



ETHNIC_NAME 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE_flME 



VARCHAR2(100) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the major ethnic groups 
such as Caucasian. 
Asian, eta 
the Ethnic code that 
specifies the detafled 
geographical and ethnic 
trackgroundofthe 
subject (patient, or 
genetic sample donor) 
the name description of 



HapAssay 



Name 



HAPJD 

ASSAYJD 

DESCR 

INSERTED_BY 

INSERT.TIME 

UPDATED_BY 

UPDATE_flME 



Type 



NOT NULL NUMBER 

NOT NULL NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



HapCompound 
^say 



NOT NULL NUMBER 



COjy^POUNDJD NOT NULL NUMBER 
ASSAYJD NOT NULL NUMBER 

VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 



DESCR 
INSERTED BY 
INSERT.TIME 
UPDATED_BY 



the haplotype of a gene 
and a compound meet in 
a specific assay 
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UPDATE_TIME 

Name Null? 



DATE 
Type 



HAP.HISTORYJD NOT NULL NUMBER 



HAPJD 
GENEJD 

CREATE_TIMESTAMP 
HAP_NAME 

history_timestamp 

original_descr 

history_descr 

inserte5_by 

insert_time 

updateo.by 

UPDATE_flME 



NUMBER 
NUMBER 
DATE 

VARCHAR2(50) 

DATE 
VARCHAR2(200) 
VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



history table to keep 
track of the knowledge 
progress concerning a 
haplotype 



when created 
when put Into history 



Nun? 



Type 



HAPJD 

GENEJD 

TIMESTAMP 

HAP_NAME 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATEO.BY 
UPDATE_TIME 



NUMBER 
NUMBER 
DATE 

VARCHAR2{50} 
VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



■Type 



HAPJD 
METHODJD 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.TIME 



NOT NULL 
NOT NULL 



NUMBER 
NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



method used in 
haplotyping 



Name 



Type 



HAPJD 

patEntjd 

DESCR 

INSERTED_BY 
INSERT_TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 



patent relates to 
haplotype 
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UPDATED.BY 
UPDATEJIME 



VARCHAR2(30) 
DATE 



HAPJD 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.flME 



NOT NULL NUMBER 

NOT NULL NUMBER 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



publication relates to a 



table 
HapSNP 



Name 



HAPJD 
POLYJD 

TIMESTAMP 

DESCR 

tNSERTED_BY 

INSERT_TIME 

UPOATED_BY 

UPDATE_flME 



NOT NULL 
NOT NULL 



Type 

NUMBER 
NUMBER 

DATE 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



HAP_SNP_HISTORYJD NOT NULL NUMBER(4) 



HAPJD NOT NULL 

POLYJD NOT NULL 

CREATE_TIMESTAMP 
HISTORY_TIMESTAMP 
original! DESCR 
HISTORY_DESCR 
INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 



NUMBER 
NUMBER 
DATE 
DATE 
VARCHAR2{200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



history about the 
progress of the SNPs 
that are used In a 



DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 



VARCHAR2(20O) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 



location type for the 
various genetic objects 
in the genome 
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UPDATEjriME 
Name 



DATE 
Type 



MAP_TYPE_ID NOT NULL NUMBER(4) 



MAP.TYPE 

DESCR 

INSERTED_BY 

INSERT.TIME 

UPDATED.BY 

UPDATE_flME 



VARCHAR2(20) 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2{30) 
DATE 



validation tool for the 
possible types of 
genome maps 



Name 



Type 



METHODJD 
METHOD 

PROTOCOL 

DESCR 

INSERTEO_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.flME 



table 

MoleculeType 



MOL_TYPE 

DESCR 

INSERTED.BY 

INSERT.TIME 

UPDATED_BY 

UPDATE.flME 



NOT NULL NUMBER 

NOT NULL VARCHAR2(50) the lab experimental 
method 

VARCHAR2(2000) the detailed protocol for 

a method 

VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 

Null? Type 



NOT NULL VARCHAR2{20) fholecular type for which 
a sequence is known 

VARCHAR2(200) 
. VARCHAR2<30) 
DATE 

VARCHAR2(30) 
DATE 



Name 



Type 



6ENE_SYMB0L NOT NULL 
6ENE_NAME 



SOURCE 

CYTO_L0CATI0N 



VARCHAR2(20) 
VARCHAR2(500) 



VARCHAR2(20) 
VARCHAR2(50) 



used to standardize the 
naming of a gene. 
HUGO official name 
takes precedence in the 
naming scheme . 

cyfogenetk; location of a 
gene; this is the best 
way to map various gene 
names onto a single 
gene 

ID by other public data 
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DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE.flME 



VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Name 



Type 



PATENTJD 
PATEhfTTYPE 

COMPANYJD 

INVENTORS 

ABSTRACT 

INSTITUTION 

CLAIMS 

TITLE 

DESCR 

INSERTED_BY 
INSERT_TIME . 
UPDATED_BY 
UPDATEjriME 



NOT NULL NUMBER 

VARCHAR2(20) 

NUMBER 
VARCHAR2(200) 
VARCHAR2(1000) 
VARCHAR2(200) 
VARCHAR2(4000) 

VARCHAR2C200) 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



patent type can be 
issued, pending, etc. 



the claims of the patent 



Nua? 



Type 



PATENT_ID 
PDFFILE 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED.BY 
UPDATE.fliWIE 



NUMBER 
BLOB 

VARCHAR2(20) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the multi-media image 
file of the patent 



Name 



Type 



PATHWAYJD NOT NULL 

PATHWAY_NAME 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_TIME 



NUMBER(4) 

VARCHAR2(50) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2{30) 
DATE 



Name 



Null? 



Type 



PATHWAYJD 
PUBJD 

DESCR 

INSERTED_BY 



NOT NULL 
NOT NULL 




publications concerning 
a pathway 
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INSERT_TIME 
UPDATED_8Y 
UPDATEjriME 



DATE 

VARCHAR2(30) 
DATE 



method used in 
discover^g a 



POLYJD 

METHODJD 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATEjriME 



NOT NULL 
NOT NULL 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30} 
DATE 



Type 



POLYJD 
FEATURE ID 



NOT NULL 
NOT NULL 



VARIATIONJTYPE NOT NULL VARCHAR2(3) 
POLY_CONSEQUENCE VARCHAR2(200) 



SYSTEM_NAME 
START_POS 



END_POS 
LENGTH 



QC 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_flME 



VARCHAR2(50) 
NUMBER 



NUMBER 
NUMBER 



VARCHAR2(20) 

VARCHAR2(200) 
VARCHAR2<30) 
DATE 

VARCHAR2(30) 
DATE 



PK for a polymorphism 

where the polymorphism 

occurs in a genetic 

feature 

what type of 

polymorphism 

the consequence or 

mechanism of 0\6 

polynwrphism 

the systematic name for 

the polymorphism 

starting position of the 

polymorphism in the 

feature 

ending position 
length of the changing 
stnicture 

FK to a table in another 
in-house database 
where the primers used 
in the polymorphism 
discovery was kept 
the number of subject 
being used in the 
discovery of the 
polymorphism 
quality control 



table Name 
PolyNameAllas 



Type 



NOT NULL NUMBER 
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NAME_ALIAS 

EXTERNAL_KEV 

KEY_SOURCE 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE.flME 



VARCHAR2(50) 

VARCHAR2(20) 
VARCHAR2{200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



other names for the 

polymorphism 

unique ID by other data 



POLYJD 
SEQ.TEXT 

OESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE.flME 



NOT NULL 
NOT NULL 



Type 



NUMBER 
VARCHAR2(250) 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the 3* DNA sequence 
that flanks the 
potymorphlcsite 



POLYJD 

SEQ_TEXT 

DESCR 

INSERTED_BY 
INSERTjriME 
UPDATED_BY 
UPDATE.flME 



NOT NULL 
NOT NULL 



Type 



NUMBER 

VARCHAR2(250) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



the 5' DNA sequence 
that flanks the 
polymorphic site 



Type 



PUBJD 
PDFFILE 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATEO_BY 
UPDATE_TIME 



NUMBER 
BLOB 

VARCHW«(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



PUBJD 

AUTHORS 

TITLE 

INSTITUTION 
SOURCE 



NUMBER 

VARCHAR2(200) 

VARCHAR2(500) 

VARCHAR2(200) 

VARCHAR2(200) 
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KEYWORDS 

ABSTRACT 

EXTERNAL_KEY 

KEV.SOURCE 

DESCR 

INSERTED.BY 
INSERTTIME 
UPDATED.BY 
UPDATE_flME 



VARCHAR2(500) 
VARCHAR2(40OO) 
VARCHAR2(50) 
VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(3p) 
DATE 



table 

SeqAccession 



SEQJD 

AccissiON 

VERSION 
Gl 

DESCR 

INSERTED_BY 
INSERTJTIME 
UPDATEO_BY 
UPDATE_TIME 



NOT NULL 
NOT NULL 



NUMBER 
VARCHAR2(20) 

NUMBER 
NUMBER 

VARCHAR2(200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE , 



PK for sequence 
unique ID from the public 
sequeiKX databases 
version of the sequence 
gene ID issues by NCBt 
national database 



table 

SeqFeature 
Location 



table 
SeqGene 



LOC_TYPE 

SEQJD 

FEATUREJD 

LOC.VALUE 

RANGE_FROM 

RANGE_TO 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_flME 



NOT NULL 
NOT NULL 
NOT NULL 



VARCHAR2(20) 
NUMBER 
NUMBER 
NUMBER 
NUMBER 
. NUMBER 
VARCHAR2{200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 

Type 



sequence and feature 
location relationship 



sequence and gene 
location relalionship 



GENEJD 

LOC.TYPE 

SEQJD 

LOC_VALUE 

RANGE_FROM 

RANGE_TO 

DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 



NUMBER. 

VARCHAR2(20) 

NUMBER • 

NUMBER 

NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
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UPDATE_TIME 



LOC_TYPE 

SEQJD 

ITEMJD 

locJtalue 
range_from 

RANGE_TO 
OESCR 

INSERTED.BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_flME 



VARCHAR2{20) 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



table 

SequenceText 



Type 



SEQ_ID NOT NULL NUMBER 

SMALL_SEQ_TEXT VARCHAR2(4000) 



LARGE_SEQ_TEXT 



OESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATEjriME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the actual sequence text 
in a string of characters 



if the sequence is less 
than 4000 characters, It 
is stored in this field 
iflargerthan4K, stored 
as a LONG datatype in 
this field which has much 
limitation in terms of 
processing capacities by^ 
the DBMS. This division 
is caused by the feet that 
a Oracle VARCHAR2 
data type can store only 
4000charactefs. 



Type 



POLYJD 
ASSAYJD 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_flME 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



Type 



polymorphism related 



NOT NULL NUMBER 
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PATEhfTJD NOT NULL NUMBER 



DESCR 

INSERTED_BY 
rNSERT_TIME 
UPDATED_BY 
UPDATE_TIME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



PUBJD 

POLYJD 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.flME 



NOT NULL 
NOTNUa 



Type 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2{30) 
DATE 



a polymorphism related 
publications 



Type 



a biological species 



SPECIESJD 
SYSTEM_NAME 

CGMMON.NAME 

DESCR 

INSERTED_BY 

INSERT.TIME. 

UPDATED_BY 

UPDATE_TIME 



NUMBER 
VARCHAR2(50) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



its scientific systematic 
its common name 



Type 



CUNICAL_SITEJD NOT NULL 
PI NOT NULL 

GENDER 
YOB 

FAMILYJD 
FAMILY.POSmON 



EXTERNAL_KEY 

KEY_SOURCE 

DESCR 

INSERTED.BY 

INSERTjriME 

UPDATED_BY 

UPDATE_flME 




VARCHAR2(20) 

VARCHAR2(20) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



patient ID as the unique 
identifier for a person 

year of birth 
family ID if known 
the generation 
information in a family 
based genetic study 
ttie ID used by other 



NuH? Type 
NOT NULL NUMBER 



the patient set used in a 
particular project 
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PI 

DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE_TIME 



NOT NULL VARCHAR2(50) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Name 



PI NOT NULL 

ETHNIC_CODE NOT NULL 
DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED.BY 
UPDATE_flME 



Type 



VARCHAR2(50) 
VARCHAR2(20) 
VARCHAR2{200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



Ethnic background of a 
person 



HAPJD 
QC 

TIMESTAMP 

DESCR 

INSERTED_BY 

INSERTjriME 

UPDATED.BY 

UPDATE_TIME 



table 

PatientHapClin 
Outcome 



Null? 



NOT NULL 
NOT NULL 



SI NOT NULL 

HAPJD NOT NULL 

CLIN_TEST_NAME 
CLIN_TEST_RESULT 
DESCR 

INSERTED_BY 
INSERTJIME 
UPDATED_BY 
UPDATE_TIME 



Type 

VARCHAR2(50) 
NUMBER 
VARCHAR2(20) 
DATE 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 

Type 



VARCHAR2(50) 

NUMBER 

VARCHAR2(50) 

VARCHAR2{20) 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



Haplotyping information 
of a person 



the dinlcal measurement 
against a particular 
haplotype in a person 



able 

iubjectHap 
listory 



Name 



Null? 



Type 



S_HAP_HISTORYJD NOT NULL NUMBER 
HAPJD NUMBER 
QC VARCHAR2(20) 
SI VARCHAR2(50) 
CREATE_TIMESTAMP DATE 



history record of the 
haplotype information for 
a subject 
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HISTORY_TIMESTAMP 

0R1GINAL_DESCR 

HISTORY_DESCR 

INSERTED_BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.TIME 



DATE 

VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



SubjectMedical 
History 



NOT NULL 
NOT NULL 



DESCR 

INSERTED_BY 

INSERT_T1ME 

UPDATED_BY 

UPDATEjFlME 



VARCHAR2(50) 
NUMBER 



VARCHAR2{200) 
VARCHAR2{30) 
DATE 

VARCHAR2(30) 
DATE 



medical conditions of a 
subject when the geneb'c 
sample is a 



FK pointing toa 
therapeutic area which 
maps to a disease 



Type 



POLYJD 
GENOTYPE 



QC 

TIIVIESTAMP 

DESCR 

INSERTED.BY 

INSERT_TIME 

UPDATED_BY 

UPDATE.TIME 



NUMBER 

VARCHAR2(20) 
DATE 
. VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the genotyping 
infonnation of-a person 
at a given polymorphic 
site 

the polymorphism may 
tie a part of a haplotype 



Name 



Type 



history record for a 
polymorphism in a 



S_SNP_HISTORYJD NOT NULL NUMBER 



POLYJD 

HAPJD 

GENOTYPE 

CREATE_TIMESTAIVIP 

QC 

HISTORY_TIMESTAMP 
ORIGINAL_DESCR 
HISTORY_DESCR 
INSERTED_BY 



VARCHAR2(50) 

NUMBER 

NUMBER 

CHAR(1) 

DATE 

VARCHAR2(20) 
DATE 

VARCHAR2(200) 
VARCHAR2(200) 
VARCHAR2(30) 
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INSERT.TIME 
UPDATED_BY 
UPDATE.flME 



VARCHAR2(30) 
DATE 



Therap 
Compound 



COMPOUNDJD 

THERAPJD 

DESCR 

INSERTED_BY 

INSERT.TIME 

UPDATED_BY 

UPDATE.flME 



NOT NULL 
NOT NULL 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



Type 



THERAP.AREA 

THERAPJD NOT NULL 

RELATED_AREA 

DESCR 

INSERTED_BY 

INSERTjriME 

UPDATED_BY 

UPDATE.TIME 



VARCHAR2(50) the disease name 
NUMBER 

NUMBER(4) is relation to other 

VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



Type 



the target gene for a 



GENEJD 

THERAPJD 

DESCR 

INSERTED.BY 

INSERT.TIME 

UPDATED_BY 

UPOATE.TIME 



NOT NULL 
NOT NULL 



NUMBER 

NUMBER 

VARCHAR2(200) 

VARCHAR2(30) 

DATE 

VARCHAR2(30) 
DATE 



Name 



Type 



VARIATION_TYPE NOT NULL VARGHAR2(3) 



DESCR 

INSERTED_BY 
INSERT_TIME 
UPDATED_BY 
UPDATE_flME 



VARCHAR2(200) 
VARCHAR2(30) 
DATE 

VARCHAR2(30) 
DATE 



the validated types of 



[0187] With reference to Figures 25A-E, and as is apparent to one of skill in the art, rectangular boxes represent 
parent tables in the database, while rounded boxes represent children tables that depend on their parent tables. This 
dependency requires that a parent record be in existence before a child record can be created. Within the tables the 
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primary keys are shown at the top and are partitioned off from the other fields by a line. Repeat instances of primary 
keys are indicated by "(FK)" meaning foreign key. 

[0188] FIG. 25F describes the relational symbols used in FIGS. 25A-E. A relational symbol such as indicated by 
reference numeral 2 represents an identifying parent/child relationship. It depicts the not nullable 1-to-O-or-many re- 

5 lationship. Not nullable means that one cannot create a record in the child unless a corresponding record (indicated 
by the particular relating field) exists or is created in the parent. A relational symbol such as indicated by reference 
numeral 4 represents a non-identifying parent/child relationship. It represents the nullable O-or-1 -to-many relationship. 
A relational symbol such as Indicated by reference numeral 6 represents an Identifying parent/child relationship. It 
depicts the not nullable 1-to-1-or-many relationship. A relational symbol such as indicated by reference 8 represents 

10 a non-identifying parent/child relationship. It represents the not nullable 1 -to-1 -or-many relationship. A relational symbol 
such as indicated by reference numeral 10 represents an identifying parent/child relationship. It depicts the not nullable 
1-to-exact-1 relationship. A relational symbol such as indicated by reference numeral 12 represents a non-identifying 
parent/child relationship. It represents the nullable O-or-1 -to-exact-1 relationship. A relational symbol such as indicated 
by reference numeral 14 represents a non-identifying parent/child relationship. It depicts the not nullable O-or-1 -to- 

15 many relationship, 

2. Database Model Version 2 

[0189] A preferred embodiment of the database model of the Invention contains 5 sub-models and 83 tables. This 
20 model is organized at three levels of detail: sub-model, table and fields of tables. 

a. Submodels 

[0190] The five submodels of this preferred embodiment are depicted in FIGURES 44A-E and are described below. 

25 [0191] Genomic Repository (Fig. 44A): This submodel organizes genomic information by spatial relationships. The 
central element of the genomic repository submodel is the Genetic_Feature object, which Is an abstract template for 
any object having a nucleotide sequence that can be mapped to the nucleotide sequence of other objects by providing 
a start and stop position. Genetic objects (also referred to herein as genetic features) that are organized by the genomic 
repository submodel include, but are not limited to, chromosomes, genomic regions, genes, gene regions, gene tran- 

30 scripts and polymorphisms. 

[0192] Some of these genetic objects contain nucleotide sequences identified in the public domain while others 
represent some derived final state of a calculation as described below for generating an assembly and gene structure. 
In object parlance, Genetic_Feature is the base class from which these other objects are extended from. In relational 
terms, the primary keys for each of these genetic objects are foreign keys to the primary key of the Genetic_Feature 

35 table. Each genetic feature is represented by a unique Fealure_ID that is generated by the database management 
system's sequence generator. The principal properties of a genetic feature are start position, stop position and refer- 
ence. The start and stop positions indicate the extent of that genetic feature relative to another given genetic feature, 
which is the reference and is represented by another unique Feature_ID generated by the database management 
system's sequence generator, The reference serves as the parent in this table by theself pointing foreign key of Ref_ID. 

40 The Feature_Type attribute gives the database model the possibility to determine what type of spatial relationship is 
legal among what types of genetic features at a given time in a given context. For example, the system will allow a 
gene to map on to a sequence assembly by defining the start and end position of the gene In the assembly. A gene 
region is mapped on to a gene through a similar mechanism. The mapping of the gene region onto the assembly will 
therefore be made possible through the transverse of links between the Seq_Assembly and Gene tables and between 

45 the Gene and Gene_Region tables. Similarly a polymorphism is mapped on to a sequence that will be a building block 
for the assembly which in turn determines the reference sequence for the gene being analyzed for genetic variation. 
[0193] This centralized organization of the positional relationships of various genetic features through one parent 
table is believed to be novel and offers significant advantages over known database designs by reducing the cost of 
maintaining the database and increasing the efficiency of querying the database. In addition, organization of genetic 

50 features by this novel relative positional referencing approach allows this infomriation to readily be organized into ge- 
nomic sequences, gene and gene transcript structures and also into diagrams mapping genetic features to the assem- 
bled genomic and gene sequences. The design and use of the genomic repository submodel are described in more 
detail below. 

[0194] The most important genetic features are defined below, with the names of the tables containing information 
55 specific to each genetic feature indicated in parentheses if different. 

Genome: The ultimate rootfeatureforall genetic features. Its reference link is always null, i.e. it is itself not mapped 
to anything. As long as there is not a complete genomic sequence, there is little reason to actually have a table 
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for this. 

Chromosome: The highest unit of contiguous genomic sequence. The reference for chromosomes would be the 
genome. Because there is no overlap between chromosomes, the genome is a disjoint assembly of all the chro- 
mosomes, in a particular order, with gaps between ail neighboring chromosomes. 
5 Assembly (Seq_Assembiy): An assembly is defined as a set of one or more contigs, ordered In a certain way. 

In the absence of genome or chromosome features, the assembly will be the root of the genomic sequence mapping 
tree. Its reference is then null. 

Contig: A contiguous assembly of overlapping sequences that are ordered 5' to 3'. A contig is preferably referenced 
to Its assembly. 

10 Unordered Contig: A collection of contiguous sequences that are not ordered and may or may not have gaps 

between them. An unordered contig, which is represented by an external accession number, is brol<en down and 
used in building the sequence assembly as a normal contig. 

Sequence (Genetic_Accession): A stretch of nucleotide sequence data. This data is represented by a unique 
accession number and a version number. Sequence data can include YACs, BACs, Gene sequences and ESTs. 
15 Typically, the source of sequence data will be GenBanl<and other sequence databases, but any piece of sequence 

Is allowed. A sequence is normally referenced to its contig. 

Gap: The gap is a zero length feature which indicates that there is an unknown amount of additional sequence to 
be inserted at this point. It Is merely an indication of lack of knowledge and has no physical counterpart. Gaps are 
usually referenced to the Assembly In which they separate the contigs. They would also be used with the genome 

20 as reference to separate the chromosomes. 

Gene: This defines the gene locus in terms of base pairs. The start and stop positions of the gene are not usually 
well defined. A gene starts somewhere between the end of the previous gene and the beginning of the first rec- 
ognized promoter element. A gene ends somewhere between the end of the last exon and the beginning of the 
next gene. In practice, including at least four kilobase pairs of promoter region are desirable. A gene is preferably 

25 referenced to an assembly. 

Gene Region: A particular region of the gene. Gene regions are classified according to their transcriptional or 
translationai roles. For a gene sequence, there are promoters, introns and exons. In a transcribed sequence, 
different gene regions include 5' and 3' untranslated regions (UTRs) as well as protein-coding regions, 
Polymorphism: A part of the genome that is polymorphic across different individuals in a population. The most 

30 common polymorphisms are SNPs, the length of which is one base pair. Ail polymorphisms are preferably refer- 
enced to the sequence with respect to which they were found. 

Primer: A short region of about 20 base pairs corresponding to an oligonucleotide for priming PGR reactions and/ 
or primer extension reactions in a variety of polymorphism detection assays. Primers are preferably referenced to 
the sequence they were designed from. 
35 Transcript: The result of a splice operation of the gene sequence. There can be several transcripts per gene, to 

Indicate splice variants. The transcript is mapped to genetic features via the Splice table, but does not map to 
anything the conventional way, i.e., its reference is always null. The transcript starts another branch of positional 
mapping of genetic features related to protein sequences. 

40 [0195] While the above definitions sets forth the preferred reference for certain kinds of genetic features (such as 
polymorphisms should be referenced to sequences), It is important to realize that the schema design allows the refer- 
ence for any particular genetic feature to be flexible and the reference may be changed as circumstances warrant. 
Wheneverthe user asks for a start or stop position, he should ask "what is the position of X relative to Y", ratherthan 

"what is the position of X", which is an ambiguous question. The correct question can be answered with a simple tree 
45 traversal routine. The answer will not depend on which genetic feature serves as the direct reference for X, 

[0196] All start and stop positions are preferably given in nucleotide positions, even for protein features. This retains 
the uniformity of the mapping scheme, and the translation to amino acid positions is trivial. The first position in a 
sequence has the position 1 . The stop position is one more than the position of the last base, such that length = abs 
( stop - start ). The stop position can be less than the start position, in which case a reverse complement needs to be 
50 taken on the reference sequence to get the feature sequence. However in another embodiment, a different physical 
map could be generated that would be expressed in something other than base pair positions, e.g. centimorgans. 
[0197] Another level of hierarchy could be added to the genomic repository submodel by implementing each gene 
region type as its own subclass extending the Gene_Region (i.e., creating separate tables for different gene region 
types with the primary key linked as foreign key to the Gene_Region table). Alternatively, the hierarchy could be flat- 
55 tened by eliminating the Gene_Region object and have individual gene region types directly subclassing 
Genetic_Feature, 

[0198] In addition, other genetic features may be added as the database develops. For example, it is contemplated 
that an additional useful genetic feature is a secondary structure region of a protein, e.g., alpha-helix, beta-sheet, turn 
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and coil regions. For each new genetic feature, a new genetic feature type needs to be created, and a table to contain 
information specific to the new genetic feature type needs to be added. Some genetic features will not have additional 
information (Gap, for example), and thus no table is necessary in such cases. The primary key of the genetic feature 
type specific table always needs to double as a foreign key to the Genetic_Feature table. This design enables the 
5 database submodel to be flexible and extendable enough to accommodate the rapid evolution and increase In volume 
of genomic information. 

[0199] Assembly of a genonnio sequence typically starts with a gene name and comprises performance of the fol- 
lowing steps by a human and/or computer operator: 

10 (a) Identify sequences related to this gene by searching GenBank and/or other sequence databases. 

(b) Generate contigs and alignments from the Identified sequences using a commercial sequence alignment pro- 
gram such as Phrap. 

(c) Store the assembly, contigs, and sequences as selected by the operator In the database (see Table A). 

15 [0200] The results of this process are one assembly made up out of one or more contigs, which In turn are made 
out of potentially many sequences. This Is illustrated In the diagram shown in Figure 47 and Table A below. 



Table A 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 




Assembly 


Assembly 








2 


Contig 1 


Contig 




1 


400 


3 


Gap 1 


Gap 




400 


400 


4 


Contig 2 


Contig 




400 


750 


5 


Gap 2 


Gap 




750 


750 


6 


Contig 3 


Contig 




750 


1000 


7 


A2345 


Sequence 


2 




250 


8 


A3724 


Sequence 


2 


30 


180 


9 


iVI28384 


Sequence 


2 


100 


350 


10 


EST283729 


Sequence 


2 


300 


400 



35 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


11 


A2445 


Sequence 


4 


1 


250 


12 


IV124783 


Sequence 


4 


200 


350 


13 


IVI9485 


Sequence 


6 


1 


250 


14 


EST374886 


Sequence 


6 


80 


220 



45 [0201 ] If there is more than one contig, the assembly will be disjoint, Indicating that an u nknown amount of sequence 
Is missing in one or more places. Each such place Is marked by a gap feature, which Is referenced to the assembly 

feature. 

[0202] The assembly may be used In conjunction with additional information on the location of gene regions, I.e., 
promoters, exons and Introns and the like, to generate a gene structure. Information on gene regions may be private 
50 or found In the public domain. Preferably, information on the gene regions Is stored In the database and the gene 
structure Is displayed to the user. An example of how such a display would typically appear is shown in Figure 48. The 
corresponding additions to Table A are shown In Table B below. 



Table B 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


15 


EXAMPLE 


Gene 


1 


120 


800 



58 



EP 1 233 365 A2 



Table B (continued) 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


16 


Promoter 


Gene Region 


15 


1 


180 


17 


Exon 1 


Gene Region 


15 


180 


280 


18 


Intron 1 


Gene Region 


15 


280 


500 


19 


Exon 2 


Gene Region 


15 


500 


680 



10 

[0203] The genomic repository database submodel of the present invention also allows referencing of gene tran- 
scripts to other genetic features. The relationship between a transcript and a genomic sequence Is not a simple start/ 
stop mapping, but requires the concatenation of separate regions of the genomic sequence into one combined se- 
quence, the gene transcript. In the present submodel, this Is represented by a Splice table, which provides an ordered 
j5 list of splice elements (usually exon regions) for each splice product (usually a transcript). Although the splice product 
Is a feature. It Is not mapped to anything else, i.e. It is the root of Its own mapping tree. Components of this tree can 
be 5' and 3' UTRs, a protein, and features related to that protein such as secondary structure or signal sequences. 
The diagram In Figure 49 shows the full mapping example down to the protein regions. The Splice table for this example 
Is set forth In Table C below, which incorporates the EXAMPLE Information from Table B: 



Table C 



Splice Id 


Order No 


Region Id 


Product Id 




1 


17 


20 




2 


19 


20 



Also, Table A would have the following additions: 



Feature Id 


Feature Name 


Feature Type 


Reference 


Start 


Stop 


20 


EXAMPLE trans 


Transcript 








21 


5' UTR 


Region 


20 




40 


22 


CETP prot 


Protein 


20 


40 


240 


23 


3' UTR 


Region 


20 


240 


280 



[0204] 2. Clinical Repository (FIGURE 44B): This submodel encapsulates polymorphism and clinical information 
about subjects and reference Individuals used In clinical trials. The Subject_Hap table associates a given haplotype 
(Identified by the field of Hap_id) with each patient subject having that haplotype (identified by the field of Sub_ID 
^0 (Subject ID)). Associations between polymorphisms in a locus (including SNPs and haploytpes ) and different clinical 
phenotypes (such as disease association and drug response) are captured by the Measure_ID and Measure_Result 
fields In the Subject_Measurement table. 

[0205] 3. Variation Repository (FiGURE 440): This submodel covers the haplotypes and the polymorphisms as- 
sociated with genes and patient cohorts used in clinical trial studies. Polymorphisms may Include SNPs, small inser- 
ts tlons/deletlons, large Insertions/deletions, repeats, frame shifts and alternative splicing. The Haplotype table has the 
basic fields of Hap_ID, Hap_Locus_ID and Hap_Name that identify a unique haplotype of a given gene or locus. A 
haplotype Is further defined by the set of SNPs that It comprises, which are listed in the Hap_SN P table. This association 
table uses data fields named Hap_ID (haplotype ID) and Poly_ID (polymorphism ID) to allow the mapping of the many- 
to-many relationship between haplotype and the polymorphism(s) that constitute the specific haplotype. The haplotype 
50 and SNP information may be used In clinical trial and drug assay studies. Data from such studies are stored in the 
clinical repository and drug repository submodels. 

[0206] 4. Literature Repository (FiGURE 44D): This submodel enables annotation of the genetic features in the 
genomic repository and the variation Information in the variation repository with public domain Information relating to 
these objects. Annotation information useful in the Invention may be found in peer-reviewed scientific publications, 
55 patent documents, or by searching on-line electronic databases. The relationship between the annotated objects and 
their referencing information are linked through the various association tables. 

[0207] 5. Drug Repository (FiGURE 44E): This submodel captures client companies, contact information, com- 
pounds used In different disease areas and assay results for such compounds In regards to polymorphisms and hap- 
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lotypes of target genes. Associations between polymorphisms in a drug target and activity of a candidate drug are 
captured by the foiiowing data fields: Hap_iD (Hap_Locus table); CompoundJD (Compound table), and the AssayJD 
(Assay, Assay_Experiment, and Assay_Result tables). 

5 b. Abbreviations 

[0208] The following abbreviations are used extensively in the data model described herein below, both in the table 
schema and in the diagram drawings shown In FIGURES 44A-E. 

10 . AA: amino acid 

• Clin: clinical 

• Descr: description 

• FK: foreign key 

• Geo: geographical 
IS . HAP: Haplotype 

• ID: Identifier 

• Info: infomation 

• Loo: location 

• Med: medical 
20 • Mol: molecule 

• NT: nucleotide 

• PK: primary key 

• Poly: polymorphism 

• Pos: position 
25 • ub: publication 

• QC: quality control 

• Seq: sequence 

• SNP: single nucleotide polymorphism 

• Sub: subject 

30 • Therap: therapeutic 

c. Tables 

[0209] This preferred embodiment of a database of the present invention contains 83 tables as follows: 

35 

1) Alignment_Component 

2) Allele 

3) Assay 

4) Assay_Experiment 
40 5) Assay_Result 

6) Assembly_Component 

7) Chromosome 

8) Clasper_Clone 

9) Class_System 
45 10) Client_Genes 

11) Clinlcal_Slt6 

12) ClinicaLTrial 

13) Cohort 

14) Company 

so 15) Company_Addr6ss 

16) Compound 

17) Contact 

18) Contig 

19) Discovery_Method 

55 20) Disease_Susceptibility 

21) Drug 

22) Drug_Target 

23) Electronic_l\/laterial 
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24) 
25) 
26) 

27) 

5 28) 
29) 
30) 
31) 
32) 

10 33) 

34) 
35) 
36) 
37) 

1S 38) 

40) 
41) 
42) 

20 43) 

44) 
45) 
46) 
47) 

25 48) 
49) 
50) 
51) 
52) 

30 53) 
54) 
55) 
56) 
57) 

35 58) 

60) 
61) 
62) 

40 63) 
64) 
65) 
66) 

67) 

45 68) 
69) 
70) 
71) 
72) 

50 73) 
74) 
75) 

76) 
77) 

55 78) 
79) 
80) 
81) 



Family 
Feature_lnfo 
Feature_Literature 
Gene 

Gene_Alias 

Gene_Class 

Gene_Hap_Locus 

Gene_Map_Location 

Gene_Nomenclature 

Gene_Pathway 

Gene_Region 

Gene_Transcript 

Genetic_Accession 

Genetic_Feature 

Genome_Map 

Genomic_Region 

Geo_Ethnicity 

Hap_Allele 

Hap_Confirmation 

Hap_Locus 

Hap_Locus_Poly 

Hap_Locus_Subject 

Haplotype 

lnd_Geo_Ethnicity 

lnd_Medical_History 

Individual 

Literature 

Locus_Accession 

IVIed_Thesaurus 

Patent 

Patent_Full_Text 
Pathway 

Pathway_Literature 

Poly_Confirmation 

Poly_Patent 

Poly_Pub 

Polymorphism 

Project 

Project_Gene 

Protein 

Publication 

Seq_Accesslon 

Seq_Assembly 

Seq_Text 

Species 

Splice 

Subject 

Subject_Cohort 

Subject_Hap 

Subject_l\/leasurement 

Subject_Poly 

Therap_Drug 

Therapeutic_Area 

Therapeutic_Gen6 

Transcript_Region 

Trial_Cohort 

TriaLDrug 

Trial_Measurement 
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82) Unordered_Contig 

83) URL 

d. Fields 

[0210] Figures 44A-E show the fields of each of the tables In the currently used database. The following are descrip- 
tions of the fields in the database: 



Table 
Name 


Field Name 


PK FK Comments 


Relationship Explanation 


Alignment 


Descr 


No No free note text about the record; occurs in 


all tables 


Component 








Weight 


No No weight for a component to take in alignn 


lent decision making 




AIignment_End 


No No end ofthe align ofcomponent in the contig 




Alignmem_Start No No start of the align of component in the contig 




Segment_List 


No No the actual consensus alignment text with 


gaps 




Component_ID No Yes component used in the alignment 






Order_Num 


Yes No order of the component in the al ignment 


An Alignment Component 








is associated with exactly 












Contig;_ID 


Yes Yes contig constnictcd by dK alignment 


^^Argri'^ t Co po e t 








is associated with exactly 








one Genetic_Feature. 


Allele 


Descr 


No No 






AA_Seq_Text 


No No amino acid sequence for the allele - 






Codon_Seq_ 


No No codon sequence 






Text 








NT_Seq_Text 


No No nucleotide sequence 






Allele_Name 


No No descriptive name 






PolyJD 


Yes Yes id of the polymorphism 


A Hap_Anele is associated 








with one to many Allele. 




Allele.Code 


Yes No name that reveals the allele, usually the 


A Subjcct_PoIy is associated 






same as NT_Seq_Text 


with exactly one Allele. 








An Allele is associated with 








exactly one Polymorphism. 


Assay 


Descr 


No No 






Assay_Type 


No No 






Ass3y_ID 


Yes No id for an assay 


An Assay_Experiment is 








associated with exactly one 








Assay. 




A$say_Name 


No No descriptive name 






Descr 


No No 




Expenmenl 


Exp_Date 


No No date of experiment 






Operator 


No No 






Exp_Parameters No No parameters used in the experiment 






Assay_ID 


No Yes the assay where'the experiment belongs 






ExpJD 


Yes No id for an experiment 


An Assay_Result is 








associated with exactly one 



Assay_ExperimenL 
An Assay_Experiment is 
associated with exactly one 

Assay. 

Assay_ Descr No No 

Result 
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QC No No quality control of the experiment 

Assay_ResuU No No free text of the assay result 

HapJD Yes Yes HAP in study 

ProteinJD Yes Yes protein in studytEVO 

CompoundJD Yes Yes compound in study 

Exp_U> Yes Yes the experiment 

CloneJD Yes Yes clone involved 



Assembly. Coniponent_ID No Yes component used in the assembly 
Component 

Desa No No 

Order_Num Yes No order ofthc component in the assembly 



Assembly_ID Yes Yes id for the assembly 

Cbromo- Descr No No 
some 

Chromosome No No descriptive name 
Name 

Spccies_ID No Yes the species of tfw genome 

Qin>mosome_ Yes Yes id for a chromosome 



An AssayResult is 

associated with exactly one 

Clasper_Clone. 

An Assay_Result is 

associated with exactly one 

Assay_Experiment 

An Assay_Result is 

associated with exactly one 

Compound. 

An Assay_Resultis 

associated with exactly one 



An Assembly_Componcnt is 

associated with exactly one 

Seq_Assembly. 

An Assembly_Component is. 

associated with zero or one 

Genetic Feature. 



A Gene_Map_Location is 

associated widi exactly one 

Chromosome. 

A Cene.Nomendature is 



Chromosome. 

A Chromosome is associated 

with exactly one 

Genelic_Feature. 

A Oiromosome is associated 

with zero or one Spedes. 



Clasper_ Clone_ID Yes No id for a clone 
Gone 

HapJD Yes Yes HAP die clone n; 

Desci- No No 

SubJD No. Yes the individual from which the clone is 



AnAssay_Resultis 
associated with exactly one 
Clasper_aone. 
Aaasper_Cloneis 
associ^uted with zero or one 
Subjects. 

A CIasper_Ctone is 
associated with exactly one 
Haplotype. 



Class_ 
System 



Path_Name No No the specific path a class is defined 

Descr No No 

Class_Name No No descriptive name 

Node_Level No No level at which the class is located 

Super_ID No No the parent of the cuirent class 

Class_ID Yes No id for a class 



A Gene_Class is associated 
with exactly one 
Class_System. 
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Class_System No No the system used to deGne the class 





Security_Code 


No No security level of the request 






Descr 


No No 






Reqiiest_Oider 


No No the physical order of the request 






Coinpany_ID 


Yes Yes id for company that makes the request 


A Client Genes is associated 








with exactly one Gene. 




Genc_ID 


Yes Yes id of the gene 


A Clicnt_Genes is associated 








widi exactly one Company. 


Clinical 


Descr 


No No 




Site 


Company_ID 


No Yes 






Site Name 


No No descriptive name 






Cliiiical_Site_ 


Yes No A Clinical_Sile R/41 at least one Subject 


A Subject is associated with 




ID 




exacdy one ainical_Site. 








A Ginical_Siie is associated 








with exactly one Company. 


Clinical_ 


Descr 


No No 


AClinical_Trialis 


Trial 






associated with one to many 








Trial_Drug. 




Therap_lD 


No Yes id for the dierapeutic area 


A Clinical_Trial is 








associated with one to many 








TriaI_Cohort 




Start_Date 


No No when the trial started 


A CIinical_Tria] is 








associated \^th one to many 








Trial_Measurement. 




Trial_ID 


Yes No id 


A Tria!_Drug is associated 








with exactly one to many 








Clinical_Trial. 




Trial_Code 


No No code for identification purpose 


A Trial_Cohort is associated 








with exactly one 








ainieal_Trial. 




Trial_Name 


No No descriptive name 


A Tnal_Measurement is 








associated with exactly one 








Clinical_TtiaL 








ACIinial_Trialis 








associated wifli one 








Therapeutic Area. 


cohort 


Descr 


No No 


A Cohort is associated with 








one to many 1>ial_Cohort 




Cohott_Namc 


No No descriptive name 


A Cohort is associated with 








one to many Subject_Cobort. 




Cohoit_ID 


Yes No id 


A Trial_Cohoit is associated 








with exactly one Cohort 




Coin|>any_ID 


No Yes company who owns die trial 


ASubject_Cohortis 








associated with exactly one 








Cohort. 








A Cohort is associated with 








exactly one Company. 



A Compound is associated 
with exactly one Company. 
A Company_Address is 
associated with exactly one 
Company. 

A Clinical_Site is associated 
widi exactty one Company. 
A Client_Genes is associated 
wift exactly one Company. 
A Cohort is associated with 
exacdy one Company. 
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Company_ No No descriptive name 

Name 

Company_ID Yes No id 



DesCT 


No No 


Web_Site 


No No 


Zip 


No No 


Country 


No No 


State 


No No 


aty 


No No 




No No 


Address_ID 


Yes No 



CDnipany_ID Yes Yes 



A Patent is associated with 

one Company. 

A Drug is associated with 

exactly one Company. 

A Company is associated 

with one to many 

Compound. 

A Company is associated 

with one to many 

Company_Address. 

A Company is associated 

with one to many 

Ch'nical_Site. 

A Company is associated 

with one to many 

CIient_Gene. 

A Company is associated 

with one to many Cohort 

A Company is associated 

with one to many Patent 

A Company is associated 

with one to many Drug. 



A Company_Address is 
associated with one to many 
Conuct 

A Contact is associated with 
zero or one 
Company_Address. 
A Conipany_Address is 
associated with exactly one 
Company. 



Conqwund Coinpound_ No No descripiivename 
Name 

Structure No No a handler for accessing the structure info 
Handler " 

Descr No No 

CompanyJD No Yes company who owns the compound 



Registraiion_ No No registration number of the compound 
Campound_ID Yes No id 

PatentJD No Yes patent on the compound 



Contact OfRce_Phone No No 
Email.Address No No 
CdLPhone No No 



A Compound is associated 
with one to many 
Assay_ResulL 
A Compound is associated 
with one to many Drug. 
An Assay_Resuk is 
associated widi exactly one 
Compound. 

A Drug is associated with 



A Compound is associated 
with zero w one Patent 
A Compound is associated 
with exactly one Company. 
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FAX 

Web_Site 
Descr 

Pager_Phone 



ContactJD 

Company_ID 
AddressJD 
Last_Name 
MiddIe_Naine 



No No 

No No 

No No 

No No 

No No 

Yes No 

No Yes 

No Yes 

No No 

No No 



A Contact is associated with 
zero or one 
Company_Address. 



Fiist_Name No No 



I No a contig is a continuous piece of DNA 



Contig_Name No No descriptive na 
Contig.lD Yes Yes id 



A Contig is associated with 
one to many 
Alignn)ent_Component 
A Alignment_C(>mponent is 
associated with exactly one 
Contig. 

A Contig is associated with 
exactly one Genetic Feature. 



Method_Name No No descriptive name 
MethodJD Yes No id 



A Discovery_Method is 
associated with one to many 
Hap_Confinnation. 
A Discovery_Melhod is 
associated with one to many 
Poly_Confinnation. 
A Hap_Confirmation is 
associated with zero or one 
Discovery Method. 
APoly_Cc " 



I Yes polymorphism in study 



Discasc_ Poly_ID 
Suscepti^ 
bility 

Ethnic_Code Yes Yes ethnic group code 
Therap_ID Yes Yes therapeutic area in sftidy 



Descr No No 

Hap.lD No Yes HAP in study 

Susceptibility No No measurement of susceptibility 



Compoun 


dJD No Yes beingacompou 


indwiifaanID 


Stage** 


ient_ No No stoge 




Side_Efre 


cts No No 




Toxicity 


No No 




Administr 


ation_ No No 




Route 






Descr 


No No 





A Disease_SusceptibiIiiy is 

associated with zero or one 

Polymorphism. 

A Discase_Susceptibility is 

associated with exactly one 

Therapeutic_Area. 

A Disease_Suscepttbility is 

associated with exactly one 

Geo_Ethnicity. 

A Disease Susceptibility is 

associated with zero or one 

Haplotype. 



A Drag is associated with 
one to many Trial_Dnig. 
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Dosage 


No No 


A Drug is associated with 








one to many Drug^Target. 




ProteinJD 


No Yes protein ID ifdrug is a protein 


A Drug is associated with 








one to many Therap_Dmg. 




DrugJD 


Yes No id 


A Trial Drug is associated 








with exactly one Drug. 




Common Name No No 


A Diiig_Target is associated 








with exactly one Drug. 




Scientific, 


No No 


A Therap_Drug is associated 




Name 




with exactly one Dmg. 




Generic_Name 


No No 


A Drug is associated with 








zero or one Protein. 




Dni&.aass 


No No classification of the drug 


A Drug is associated with 








zero or one Compound. 




Coinpany_ID 


No Yes company who owns the drug 


A Drug is associated with 








exactly one Company. 


Dnig_ 








Target 










Gene_n> 


Yes Yes the gene that the drug works on 


A Drug Target is associated 






with exactly one Drug. 




Dni£.lD 


Yes Yes drug in study 


A Dnjg_Target is associated 








with exactly one Gene. 


Electronic. 


. Receive_Date 


No No captures the refenncing material 








distributed electronically 






Descr 


No No 






Title 


No No 






Contents 


No No 






Email_Address 


No No 






lnfo_Source 


No No 






In&_ID 


Yes Yes 


An EIectronic_Malerial is 








associated with exactly one 
Literature. 




DaU_Type 


No No 






Authors 


No No 




Family 


Descr 


No No 






Generation.!^ No No number of generation into the ancestiy 






Mother 


No Yes 






Father 


No Yes 


1 ■ assocated with 








exactly one Individual. 




Fami1y_ID 


Yes No id 


A Family is associated with 








exactly one Individual. 


Feature 


Descr 


No No 




Info 










DetaiI_Valuc 


No No feature info value 






Feature. 


Yes No feature info category. 






Qualifier 








Feature_ID 


Yes Yes 


A Feature.Info is associated 








witfi exactly one 








Genetic Feature. 


Feature. 


Descr 


No No feature to literature association 




Literature 


Literature_ID 


Yes Yes 


A Feature_Literature is 






associated with exactly one 








Genetic.Feature. 




Feature.IO 


Yes Yes 


A Feature.Literatuie is 








associated with exactly one 








Literature. 


Gene 






A Gene.Map_Location is 








associated with exactly one 
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Gene_Symbol No Yes standard symbol 

Oescr No No 

SpeciesJD No Yes species in which the gene is located 

Cene_ID Yes Yes id 



A Client_Genes is associated 
with exactly one Gene. 
A Seq_Gene_Location is 
associated with exactly one 
Gene. 

A Feature_Gene_Location is 
associated with exactly one 
Gene. 

A Therapeutic_Gene is 
associated with exactly one 
Gene. 

A Gene_Pathway is 
associated with exactly one 
Gene. 

A Drug_Tar;get is associated 
with exactly one Gene. 
A Cene_Class is associated 
widi exactly one Gene. 
A Patent is associated with 
zero or one Gene. 
A Project_Gene is associated 
with exactly one Gene. 
A Gcnc_Hap_Locus is 
associated with exactly one 
Gene. 

A Gene_Transcript is 
associated with zero or one 

A Gene_Region is associated 
with exactly one Gene. 
A Gene_Alias is associated 
with exactly one Gene. 
A Protein is associated with 
exactly one Gene. 
A Gene is associated with 
one to many 
Gene_Map_Location. 
A Gene is associated with 
one to many Client_Gene. 
A Gene is associated widi 
one to many 
Seq_Gene_Location. 
A Gene is associated with 
one to many 
Featute_Gene_Location. 
A Gene is associated with 
one to many 
Therapeutic_Gene. 
A Gene is associated widi 
one to many Gene_Pathway. 
A Gene is associated widi 
one to many Drug;_Target 
A Gene is associated with 
one to many Gene_Class. 
A Gene is associated with 
one to many Patent. 
A Gene is associated widi 
one to many Project_Genc. 
A Gene is associated with 
one to many 
Gene_Hap_Locus. 
A Gene is associated with 
one to many 
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Gene_Transcript 

A Gene is associated with 



onetomanyGene_AIias. 

A Gene is associated with 

one to at least one Protein. 

A Gene is associated with 

exactly one Species. 

A Gene is associated with 

exactly one Genetic_Feature. 

A Gene is associated with 

exactly one Species. 

A Gene is associated with 

exactly one 

Gene_Noroenclature. 



Gene Descr 
Alias" 

GcneJD 

Alias.Name 

Gene_Alias_ID 



No No 
No Yes 

No No descriptive ni 
Yes No id 



A Gene_Alias is associated 
with exactly one Gene. 



Gcne_ Descr 
Class 

Class ID 



No No 

Yes Yes gene dassification 
Yes Yes 



A Gene_Class is associated 
widi exactly one Gene. 
A Gene_Class is associated 
with exactly one 
Class System. 



Gene_Hap Descr 

Hap_Locus_ID 



No No HAP association to the gene 
Yes Yes 



A Gene_Hap_Locus is 
associated with exactly one 
Gene. 

A Gene_Hap_Locus is 
associated widi exactly one 
Hap Locus. 



Gene_Map Map_Location 
_Location 

Descr 



No No location of the gene in the genome 
No No 

No Yes the chronjosome 



Yes Yes id of die map 
Yes Yes gene 



A Gene_Map_Location is 
associated with exactly one 
Gene. 

A Gene_Map_Location is 

associated with exactly one 

Chromosome. 

A Gene_Map_Location is 

associated with exactly one 

Genome Map. ' 



Gene_ Chromosome. 
Nomen- ID 
clature 

Descr 

Cyto_Location 



Gene_ 

Description 

Genc_Namc 



No Yes the standard literature for the gene 
No No 

No No cytological location of gene 
No No 

No No descriptive name 



A Gene_Nomenclature is 



Cene_NomenclatuFe. 
A Ceae_Nomenclalure is 
associated with zero or one 



A Cene_Nomenclanire 
exacdy I Gene. 
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Gene_Symbol Yes No standard symbol 
Most_Current No No version management of the re 



Locus_ID No No id 



A Gene is associated with 
exactly one 
Gene_Nomenclature. 



Gene_ Descr No No 
Pathway 

Gene_ID Yes Yes 

Pathway_ID Yes Yes biological pathway 



A Gene_Pathway is 
associated with exactly one 
Padiway. 

A Cene_Pathway is 
associated with exactly one 
Gene. 



Gene_ Region_Type No No 
Region 



Region_Name No No descriptive name 



Descr 
Gene_ID 



No No 

No Yes gene it belongs to 



RegionJD Yes Yes id 



A Gene_Region is associated 
with one to many 
Polymorphism. 
A Polymorphism is 
associated with zero or one . 
Gene_Region. 

A Genonue_Region is 

associated with exactly one 

Gene_Region. 

A Transcript_Region is 

associated with exactly one 

Gene_Rcgion. 

A Gene_Region is associated 

with one to many 

Genoniic_Region. 

A Gene_Region is associated 

with one to many 

Transctipt_Region. 

A Gcne_Region is associated 

with exactly one 

Genetic_Feature. 

A Gene_Region is associated 



Gene_ Descr 


No No 


A Gene_Transcript is 


Transcript 




associated with one to many 






Splice. 


Transcript_ 


No No descriptive name 


A Gene_Transcript is 


Name 




associated with one to many 






Tianscript_Region. 


Gene_D 


No Yes gene it belongs 10 


A Splice is associated with 


exactly one 






Gene_TVansciipL 


Trtnscript_ID 


Yes Yes id 


A Transcript_Region is 




associated with exactly one 






Gene_Transcfipt 






AGene.Transcriptis 






associated with exactly one 






Genetic_Feature. 






A Gene.Ttanscript is 






associated with zero or one 






Gene. 


Genetic_ Mol_Type 


No No molecular type of the record 




Accession 




URLJD 


No Yes IheURLaddiessontheweb 




Source_Nart)e 


No No 




Descr 


No No 




Accession 


No No the actual accession code 


A Genetic_Accession 'a 


Code 




associated with zero or one 
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Seq_Version No No sequence version number 
!SsionJD Yes Yes id 



No No GI number used in GenBank 



A Genetic_Accession is 
associated with exactly one 
Genetic_Feature. 



Mosl_Cunent No No veision management of the lecoid 



Featiire_Type No No type of the feature 



be high level abstraction of genetic objects A Genetic_Accession is 

associated with exactly one 

Cenetic_Feature. 

A Protein is associated with 

exactly one Genetic_Feature. 

A Chromosome is associated 

with exactly one 

Genetic_Fealure. 

A Feature_Literatiire is 

associated with exactly one 

Genetic_Feature. 

A Polymorphism is 

associated with exactly one 

Genetic_Feature. 

A Gene_Region is associated 

with exactly one 

Genetic_Feature. 

A Gene is associated with 

exactly one Genetic_Feature. 

A Seq_Feature_Location is 

associated with exactly one 

Genetic.Featuie. 

A Feature_Geiic_Location is 

associated widi exactly one 

Genetic_Feaiuie. 



No No parent of a feature in term of positional 



No No start position ofthe feature in its parent 



Comptement No No whether on the reverse strand 



with exactly one 

Genetic_Feature. 

A Gene_Transcript is 

associated with exactly one 

Ocnetic_Fcalure. 

A Seq_Assembly is 

associated with exactly one 

Oenetic_Feature. 

A Unordered_Contig is 

associated with zero or one 

Genetic_Feature. 

A Unotdered_Contig is 

associated widt zero or one 

Genetic_Featute. 

AUnoideied_Cbntigis 

associated widi exactly one 

Genetic_Feaiufe. 

A Gcnetic_Featurc is 

associated with zero or one 

Genetic_Feature. 

An AssembIy_Component is 

associated with zero or one 

Genelic_Feature. 

An Alignment_Component 

is associated with exactly 

one Gcnetic_Feature. 

A Contig is associated with 

exactly one Genetic_Feature. 

A Splice is associated wiAi 

exactly one Genetic_Feature. 



71 



EP 1 233 365 A2 



ASeq_Textisas 
with exactly one 
Genetic_Feature. 
A Cenetic_Fealure is 
associated with one to many 
Genetic_Accession. 
A Genetic_Fealure is 
associated with one to 
exactly I Protein. 
A Genetic_Feature is 
associated with one to many 
Chromosome. 
A Genetic_Feature is 
associated with one to many 
Feature_Literature. 
A CEnetic_Feature is 
associated with one to many 
Polymorphism. 
A Genetic_Feature is 
associated with one to many 
Gene_Region. 
AGenetic_Featureis 



Genes. 

A Genetic_Feature is 
associated with one to at 
least one 

Seq_Feature_Location. 
A Genetic Feature is 

~ exactly one 



Feature_Gene_Location. 

AGenetic_Fealureis 

associated with one to many 

Feature_Info. 

AGenetic_Featureis 

associated with one to many 

Gene_Transcript 

AGenetic_Featureis 

associated with one to many 

Seq_Assembly. 

AGcnetic_Featureis 

associated with one to many 

Unordered_Contig. 

A Genetic_Fcature is 

associated with one to many 

Unordeied_Contig. 

A Genetic "Feature is 



Unoidered_Contig. 
AGenetic_Fcatureis 



Genetic_Feature. 
A Genetic_Feature is 
associated with one to many 
Assembly_Component 
A Genetic_Feature is 
associated with one to many 
Alignment_Component 
A Genetic_Feature is 
associated with one to many 
Conlig. 

AGenetic_Featureis 
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Genome_ Extemal_Key No No legendary key 
Map 

Descr No No 



Map_Type No No type of the map 
MapJD Yes No id 



associated with one to im 
Splice. 

A GeneticFeature is 
associated with one to ma 
SeqJText 

A Genetic_Featiire is 
associated with zero or oi 
Genetic Feature. 



A GefK>ine_Map is 
associated with exactly oi; 
Species. 

A Genotne_M^ is 



Map_Name No No descriptive name 

Most_CufTent No No version manafement of die record 



SpeciesJD No Yes species of the map 



Gene_Map_Location. 
A Cenome_Ms^ is 
associated widi zero or one 
Genome_Map. 

A Gene_Map_Lpcation is 
associated with exactly one 
Genome_Map. 



Genomic. Descr No No gene region in terms of ONA organization 

Region 

RegionJD Yes Yes id 



Geo_ Etluiic_Groiip No No the major ethnic group name 
Ethnicity 

Descr No No 

Ethnie_Name No No descriptive name 

Ethnic_Codc Yes No code for a specific ethnic sub-group 



A Genoraic_Regioii is 
associated widi exactly one 
Gene Region. 



A Di$ease_Susceptibility is 
associated widi exactly one 
Geo_Ethnicity. 
A lnd_Geo_Ethnicity is 
associated with exactly one 
Geo_Ethnicity. 
A Poly_Confinnation is 
associated with zero or one 
Gco_Ethnicity. 
A Hap_Confirmation is 
associated with zero or one 
Geo_Ethnicity. 
A Gco_Elhnicity is 
associated with one to many 
Disease_Susceptibility. 
A Geo_E*"'"'y is 
associated with one to many 
Ind_Geo_Ethnicity. 
A deo_Eihnicity is 
associated with one to many 
Poly_Confitmation. 
A Geo_Ethnicity is 
associated with one to many 



Hap_Allele Descr 


No No 




PolyJD 


Yes Yes polymorphism that constituting the HAP 




AlleIe_Code 


Yes Yes the specific allele of that polymorphism 


A Hap_AIlele is associated 




with exactly one Haplotype. 


Hap_ID 


Yes Yes HAP 


A Hap.Allele is associated 
with exactly one Allele. 



I No sample size in the HAP study 
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Extemal_Key No No legendary key 

QC No No quality info 

Deser No No 

Naine_Alia5 No No other names 

Source_Name Yes No where reported 

Hap_Locus_ID Yes Yes id 

Ethnic_Code No Yes sub-group of population 

MethodJD No Yes method us 



A Hap_Confimiation is 

associated with zero or one 

Geo.Ethniciiy. 

A Hap.Confirmation is 

associated with exactly one 

Hap_L«cus. 

A Hap_Confinnation is 

associated with zero or one 

Discovery_Method. 



the HAP built on a locus region 



Descr 



No No 



Hap_Locus_ No No descriptive name 
Name 

Most_CurTent No No version management of the re 

Hap_Locus_ID Yes No id 



A Haplotype is as 
with exactly one Hap_Locus. 
A Hap_Locus_Poly is 
associated with exactly one 
Hap_Locus. 
A Gene_Hap_Locus is 
associated with exactly one 
Hap_Locus. 

A Hap_Locus_Subject is 
associated with exactly one 
Hap_Locus. 

A Hap_Locus is associated 
with zero or one Hap.Locus. 
A Subject_Hap is associated 
with exactly one Hap_Locus. 
A Hap_Confinnation is 
associated with exactly one 
Hap_Locu5. 

A Hap_Locus is associated 

with zero or one Hap_Locus. 

A Hap_Locus is associated 

with one to many Haplotype. 

A Hap_Locus is associated 

with one to many 

Hap_Locus_Poly. 

A H3p_Locus is associated 

with one to many 

Gene_Hap_Locus. 

A Hap_Locus is associated 

with one to many 

Hap_Locus_Subject 

A Hap_Locus is associated 

with one to many 

Hap_Locus. 

A Hap_Locus is associated 

with one to many 

Subject.Hap. 

A Hap_Locus is associated 

with one to many 

Hap Confirmation. 



Hap Lo 
_Poiy 



No No HAPtoSNP 



PolyJD Yes Yes 

Hap_Locus_ID Yes Yes 



A Hap_Locus_Poly is 

associated with exactly one 

Hap_Locus. 

A Hap_l.ocus_Po!y is 

associated with exactly one 

Polymorphism. 
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Hap Locus Hap_Locii5_ID Yes Yes HAP to subject association 
Subject 

Descr No No 



Sub_ID 



Yes Yes 



A Hap_Locus_Subject is 
associated wiOt exactly one 
Hap_Locus. 

A Hap_Locus_Subject is 
associated widi exactly one 
Subject. 



Descr No No 

Hap_Name No No descriptive name 

Hap.LocusJD No Yes HAP locus to which this HAP belongs 



Hap.ID 



Yes No id 



ASubject_H 
with exactly one Haplotype. 
A Hap_Allele is associated 
with exactly one Haplotype. 
A Disease_Susceptibility is 
associated with zero or one 
Haplotype. 
ACIasper_CIoneis 
associated with exactly one 
Haplotype. 

A Haplotype is associated 

with one to many 

Subject_Hap. 

A Haplotype is associated 

with one to many 

Hap_Allele. 

A Haplotype is associated 



with one to many 

Clasper_aone. 

A Haplotype is associated 

with exactly one Hap Locus. 



Ind_Geo_ Ethnic_Code Yes Yes individual's ethnic background 
Ethnicity 

IndJD Yes Yes 

Descr No No 



Genetic_Weight No No the weight of diffierent ethnic heritage 



An Ind_Geo_Ethnictty is 
associated with exactly one 
Individual. 

A lnd_Geo_Elhnicity is 
associated with exactly one 
Geo Ethnicity. 



Ind Med- Descr No No Medical history for an individual 

icaf 

History 

IndJD Yes Yes 

■nierap_ID Yes Yes 



An Ind_Medical_History is 

associated with exactly one 

Therapeutic_Area. 

An Ind_MedlcaI_History is 

associated with exactly one 

Individual. 



Individual Descr 
YOB 



No No individual info 

No No year of birth 

No No 

No No 



SpeciesJD 
Ind_Type 
lnd_Code 



No Yes possible 
No No 
No No 



for cross species study 



An Ind_Ceo_Ethnicity is 
associated with exactly one 
Individual. 

A Family is associated with 
exactly one Individual. 
A Family is associated with 
exactly one Individual. 
An Ind_Medica]_Histoiy is 
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Literature Desa 

Image_File 



No No 
No No the large 



multimedia file for the record 



Source_Name No No 
e_Type No No 

.ID Yes No id 



No Yes URL address on die web 



A Subject is associated with 
exactly one Individual. 
An Individual is associated 
with one to many 
Ind_Geo_Edmicity. 
An Individual is associated 
with one to zero or one 
Family. 

An Individual is associated 
with zero to many 
Ind_Medical_Histoiy. 
An Individual is associated 
with zero to one Subject. 
An Individual is associated 
with exactly one Species. 

A Patent is associated with 
exactly one Literature. 
A Publication is associated 
with exactly one Literature. 
A Electronic_Material is 
associated with exactly one 



Literature. 
APathway_l 
associated with exactly one 
Literature. 



with zero or one URL. 
A Literature zero to many 
Patent. 

A Literature is associated 
with zero many Publicau'on. 
A Literature is associated 
with zero many 
Electronic_MateriaI. 
A Literature is associated 
with zero many 



AccessionJType No No the inolecule type for die sequence 

Descr No No 

Ucus_lD Yes No NCBIIocusid 

No No theactuali 



Dau.Source No No medical terminology 

Extemal_Key No No 

Descr No No 

TermJD Yes No 



Definition 
URL ID 



No No 
No Yes 



AMed_T 
associated with zero or one 
URL 
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Medical_Tenn No No 



Patent Institution 


No No patent info 




Year 


No No 




Title 


No No 


A Patent is associated with • 






zero many Patent_Full_Text. 


Abstract 


No No 


A Patent is associated with 






zero many Compound. 


Cianted_By 


No No 


A Patent is associated with 






zero many Poly_Patent 


Descr 


No No 


A Patent is associated with 






zwo or one Gene. 


PatentCIaims 


No No 


A Patent is associated with 






zero or one Company. 


Inventors 


No No 


A Patent is associated with 




exactly one Literature- 


Patent ID 


Yes Yes 


A Patent_Full_Text is 






associated with exactly one 






Patent. 


Gene ID 


No Yes 


A Compound is associated 






with zero or one Patent 


Patent_Num 


No No 


A Poly_Patent is associated 






with exactly one Patent 


Cottipany_ID 


No Yes 




Patent_Typ€ 


No No could be pending, approved, cic. 




Patent Full Descr 


No No 




Text 






Full_Text 


No No the full text document 




Patcnt_n> 


Yes Yes 


A Patent_Full_Text is 






associated with exactly one 






Patent 


Pathway Pathway Name No No biological pathway info • 


A Cene_Pathway is 






associated with exactly one 






Pathway. 


Pathway_ID 


Yes No 


A Pathway_Literature is 




associated with exactly one 






Pathway. 


Descr 


No No 


A Pathway is associated with 






one to many Gene_Pathway. 






A Pathway is associated with 






one to many 








Pathway_ Descr 
Literature 






PathwayJD 


Yes Yes 


A Pathway_Literature is 






associated with exactly one 






Literature. 




Yes Yes 


A Pathway_Litcrature is 






associated with exactly one 
Pathway. 


Poly_ Method.ID 


No Yes polymorphism confinnation info 




ConHr- 






Source_Naine 


Yes No which data source 




Naine_Alias 


No No alias name 




PolyJD 


Yes Yes id 




Descr 


No No 




QC 


No No quality control info 




External Key 


No No legendary key 


A Poly_Confinnation is 






assoeiaTed with exactly one 
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Sanip1e_Size No No size of sample in discovery 



Edinic_Code No Yes ei 



A Poly_ConfinTiation is 
associated with zero or one 
OiscoveiyMethod. 
A Poly_Coniinnation is 
associated with zero or one 
Geo Ethnicity. 



Poly^ Descr No No 
Pateiii 

Poly_ID Yes Yes 

Patent_lD Yes Yes 



polymorphism patent association 



A Poly_Patent is associated 
with exactly one Patent 
A Poly_Palent is associated 
with exactly one 
Polymorphism. 



Poly_Pub Descr 
Pub ID 



No No polymorphism publication af 
Yes Yes 



A Poly_Pub is associated 
with exactly one Publication. 
A Poly_Pub is associated 
with exactly one 
Polymort>hism. 



Poly- Mol_ No No 

morphism Consequence 



of the polymoiphisr 



Primcr_Pair_ID No No primer used in the discovery 



I flanking sequence on 3' end 



ion_lD No Yes 



A SLbject_Poly is 
with exactly one 
Polymorphism. 
A Poly_Pub is as 
with exactly one 
Polymorphism. 
A Polymorphism is 
associated with one to many 
Subject_Po!y. 
A Polymorphism is 
associated with one to many 
Poly_Pub. 
A Polymorphism is 
associated with exactly one 
Gcnetic_Fcature. 

the region where the polymorphism locates A Discase_SusceptibiIity is 
associated with zero or one 



I flanking sequence on 5* 



Poly_Lengdi No No length of the variation 

PolyJD Yes Yes id 

Variation_Type No No type of variation 

System_Name No No systematic name of the polymoiphism 



Polymorphism. 

A Poly_Patent is associated 

with exactly one 

Polymorphism. 

AH^_Locus_Polyis 



A Allele is associated with 
exactly one Polymorphism. 
A Poly_Coiifinnation is 
associated with exactly one 
Polymorphisnu 
A Polymorphism is 
associated with zero to many 
Disease_Susceptibility. 
A Polymorphism is 
associated with zero to many 
Poly_Patent. 
A Polymorphism 1^361 
many Hap_Locus_Poly. 
A Polymorphism is 
associated with at least one 
Allele. 

A Polymorphism is 



78 



EP 1 233 365 A2 



A Polymorphism is 
associated with zero or one 
Gene Region. 





Project 


Descr 


No No project info 






5 




Submitter 
Projcct_ 
Manager 
Project_Name 


No No 
No No 

No No 




A Project is associated wiA 


10 




Project_ID 


Yes No 




one to many Project_Gene. 
A Projecl_Gene is associated 
with exactly one Project. 




Projeci_ 
Gene 


Descr 

Gene.ID 

ProjectJD 


No No project gene 1 
Yes Yes 
Yes Yes 


association 


A Project_Gene is associated 
with exactly one Project. 
A Preject_Gene is associated 
with exactly one Gene. 




Protein 


Descr 


No No 




A Protein is associated with 






Structure 


No No protein stnict 


ure info handler 


zero to many Drug. 

A Protein is associated with 


25 




Handler ~ 
ProteinJD 


No Yes gene it belonj 
Yes Yes id 




zero to many Assay.ResulL 
A' Drug is associated with 
zero or one Protein. 
An Assay Result is 
' associated with exactly one 

A Protein is associated with 
exactly one Gene. 
A Protein is associated widi 
exactly one Genetic Feature. 




PublicatioT 


1 Keywords 
Abstract 


No No 






30 




DesCT 
Title 


No No 

No No 






35 




Institution 

Year 

PubJD 

Journal 


No No 
No No 
Yes Yes 

No No 




A Publication is associated 
with zero to many Poly_Pub. 
A Publication is associated 
with exactly one Literature. 
A Poly_Pub is associated 
with exactly one Publication. 


40 
45 


Assembly 


Name ^~ 

AssemblylD 


No No the consensus 
alignment 

No No 

Yes Yes id 


I se()iience buih fiom 


A Seq_Asscnibly is 
associated with one to many 
Assembly_Component 
A Se^Assembly is 
associated with exactly one 
Gcnetic_Fcafure. 
An Assembly_Component is 
associated witfi exactly one 
Seq Assembly. 




Seq_Tcxt 


Descr 
Seq_Text 


No No 

No No the actual seq 


uence text 




50 




Seq_ID 


Yes Yes id 




A Seq_Text is associated 
with exactly one 
Genetic Feature. 




Species 


Alias_Name 


No No other names 
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SpeciesJD 
Dcscr 



Yes No id 
No No 



System_Name No No systematic name of the species 
leNo No common name 



A Gene is associated with 
exactly one Species. 
A Genome_Map is 
associated with exactly one 
Species. 

A Gene is associated with 

exactly one Species. 

A Chromosome is associated 

with zero or one Species. 

A Individual is associated 

with exactly one Species. 

A Species is associated with 

one to many Gene. 

A Species is associated with 

zero to many Genonie_Map. 

A Species is associated with 

one 10 many Gene. 

A Species is associated with 



Splice Component_10 No Yes component involved in the splicing 
Descr ~ No No 

Order_Num Yes No order ofthe component in the splicing 
product 

TranscriptJD Yes Yes id for the transcript 



A Splice is associated with 

Gene_Tianscript. 
A Splice is associated with 
exactly one Genetie_Featiire. 
A Clasper_CIone is 



Subject this is a subset of individual 

Descr No No 

Extemal_Key No No 

Clinical Site_ No Yes collection she 
ID 

SubJD Yes Yes id 



A Subject.Poly is associated 
with exactly one Subject. 
A Subject_Hap is associated 
with exactly one Subject. 
A Subjeet_Cohort is 
associated with exactly one 
Subject 

A Subject_Measurement is 
associated with exactly one 
Subject 

A Hap_Locus_Subject is 
associated with exacdy one 
Subject 

A Subject is associated with 
zero to many Clasper_Clone. 
A Subject is associated with 
zero to many Subject_Poly. 
A Subject is associated with 
zero lo many Subject_Hap. 



Subject_Cohort 



dwith 



zero to many 

Subject_Measuf«ment 

A Subjrct is associated with 

zero to many 

Hap_Locus_Subject 

A Subject is associated with 

exactly one Clintcal_Site. 
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A Subject is associated with 
exactly one Individual. 



Subject, 


CohortJD 


Yes Yes cohort subject association 




Cohort 










Descr 


No No 


ASubject_Cohortis 








associated with exactly one 








Subject 




SubJD 


Yes Yes 


ASubject_Cohortis 








associated with exactly one 




— - — — 




Cf^ot. 


g^^^j^ — 


Hap_Locus_iD 


— — — — — : — 

Yes Yes subject HAP typing info 




Hap 










Copy Num 


Yes No identify the copy of the HAP 






QC 


No No quality control data 


A Subject_Hap is associated 








with exactly one Haplotype. 




Descr 


No No 










with exactly one Subject 




HapJD 


No Yes id of HAP 


A Subject_Hap is associated 








with exactly one Hap_Locus. 




Sub_ID 


Yes Yes id of subject 


Subject. 


Measuie.Num 


Yes No subject clinical measurement 




Measure- 










Measure_Rcsult No No result of the measurement 






Measure ID 


Yes Yes id 














^ator 


No No who did it 






QC 


No No quality control data 


A Subject_Measureinent is 








associated with exactly one 












Measuie_Date 


No No when if s done 


A Subject.Measurement is 








associated with exactly one 








Trial^^Measurement 




Sub ID 


Yes Yes subject being nwasuted 




Subject_ 


PolyJD 


Yes Yes subject genotyping info 




0 y 


Copy_Num 


Yes No identify Uie copy of the SNP 






Descr 


No No 


A Subject_Poly is associated 








with exactly one Subject 




Allele_Q)de 


No Yes the allele for the subject 


A Subject_Poly is associated 








with exacdy one Allele. 




QC 


No No quality control data 


A Subject_Poly is assodated 








with exactly one 








Polymorphism. 




Desa 


No No 




Thcrap_ 


Drug_lD 


Yes Yes drug info for the therapeutical area 


A Therap_Diug is associated 


Drug 






with exactly one 








Therapeutic_Area. 




Therap.ID 


Yes Yes 


A Therap_Drug is associated 








with exactly one Drug. 








A Therap_Dnig is associated 








with exactly one 








Therapeutic Area. 


Thera- 


Descr 


No No the look up table for the therapeutic: 


ireas A Therapeutic_Gene is 


peutic 






associated with exactly one 


Area ~ 






Therapeutic_Area. 




Related_Area 


No No 


A lnd_Medical_Histoiy is 








associated with exactly one 








Tlierapeutic_Area. 
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Therap_Area No No 
TherapJD Yes No 



Thera- Descr 
peutic_ 
Gene 



A Di$ease_Susceptibililyis 
associated with exactly one 
Therapeutic Area. 
A ClinicaLf rial is 
associated with zero or one 
'nieiapeutic_Area. 
A Thefapeutic_Atea is 
associated with zero to many 
nienip_Dnig. 
A Therapeutic_Area is 
associated with zero to many 
Theiapeutic_Gene. 
A Therapeutic_Area is 
associated with zero to many 
lnd_MedicaI_History. 
A Therapeutic_Area is 
associated with zero to many 
Disease_Susceptibility. 
A Therapeiitic_Area is 
associated with zero to many 
ainical Trial 



No No gene links to the therapeutical 
nierapJD Yes Yes 



GeneJD Yes Yes 



A Therapeutic_Gene is 
associated with exactly on 
Therapeutic_Area. 
A Therapeutic_Gene is 
associated with exactly on 
Gene. 



Tianscript_ Descr No No 

Region 

Transcript_ID No Yes link between gene region and di 



RegionJD Yes Yes 



TVial_ Descr No No 

Cohort 

Cohort JD Yes Yes cohort invoked in the clinical tri 



A Transcript_Rcgion is 

associated with exactly on 

Gene_Region. 

A Transcript_Region is 

associated with exactly on 

Gene Transcript. 



A Trial_Cohoit is associated 

with exactly one 

ClitiiealJTrial. 

A Trial ^Cohort is associated 

with exactly one Cohort 



Trial_Drug Descr No No 

Tria!_ID Yes Yes drug used in the clinical trial 



A TriaI_Drug is associated 
with exactly one Drug. 
ATrial_Dfugiss 
with exactly one 
Clinical Trial. 



Measure_ No No t 
Details ~ 

Descr No No 

Measure_Type No No type 

Measure No No abbreviation form of then 

Abbrev ~ name 



A Trial_Measurement is 
associated with one to many 
Subject_Measurement. 
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MeasuFe_ID 


Yes No id 


A Subject_Measurernent is 










associated with exactly one 












5 




Tnal_ID 


No Yes trial in which the measurement is taken 


A Tnal_Measurcnicnl is 




















Clinical Trial. 




Unoidcred 


Descr 


No No a table to handle the unordered sequence 






_Contig 








10 




• 

Uncwntig;_Seq_ 


No Yes the actual sequence corresponding 


All d ed r ri ' 






ID 




"■"ted Th "actl^ 
associa with exact y one 
















Uncoiitig^List__ 


No Yes the accession in which ir s reported 


aTi^ ' d bI Omf " 
'^°'t^ Th 






ID 
















15 




■ in, 

uncontig_ID 


es es id 












associa ed with zero or one 










— "["^ — ^ 




— 

URL 


— — 

URL 


— — — — 

No No the URL address 


A uenetic_Accession is 










associated with zero or one 


20 








a'mwI Th 




Most_Current 


No No version management for the record 












A Med_ Ijiesaurus is 
associated with zero or one 










Vimr J u 






in 


es 0 


a URL is associated with 






- 




zero or one URL. 


25 




Descr 


No No 


A Literature is associated 










with zero or one URL. 










A URL is associated with 




















A URL is associated with 










zero to many 


30 








Cenetic_Acccssion. 










A URL is associated with 










zero to many 










Med_'niesattnis. 

A URL is associated widi 










zeratooneURL. 


35 








A URL is associated with 



G. BUSINESS MODELS 



1. Hap2000 Partnership 

[0211] The haplotype and other data developed using the methods and/or tools described herein maybe used in a 
partnership of two or more companies (referred to herein as the Partnership) to Integrate knowledge of human popu- 
lation and evolutionary variation Into the discovery, development and delivery of pharmaceuticais. The partners In the 
partnership may be classified as pharmaceutical, blopharmaceuticai, biotechnology, genomics, and/or comblnatoriai 
chemistry companies. One of the partners, referred to herein as the HAP''^''' Company, will provide the other partner(s) 
with the tools needed to address drug response problems that are attributable to human diversity. 
[0212] The HAP^'" Company will focus on Identifying polymorphisms in genes and/or other loci found In a diverse 
set of Individuals, Information on which will be stored in a database (referred to herein as the isogenomlcsTw Database) . 
Preferably, the database Is designed to store polymorphism Information for at least 2000 genes and/or other loci that 
are important to the phannaceuticai process. In a preferred embodiment, the polymorphisms Identified are gene specific 
haplotypes and the genes chosen for analysis will be prioritized by the HAPT" Company by pharmaceutical relevance. 
Analyzed genes may Include, while not being limited to, known drug targets, G-coupled protein receptors, converting 
enzymes .signal transduction proteins and metabolic enzymes. The database will be accessible through an informatics 
computer program for epidemiological correlation and evaluation, a preferred embodiment of which is the DecoGen™ 
application described above. 
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a. Partnership Benefits 

i. Isogenomics^" Database 

5 [021 3] The partners will have non-exclusive access to the Isogenomics"^'^ Database, which contains the frequencies, 
sequences and distribution of the polymorphisms, e.g., gene haplotypes, found in a diverse set of individuals, referred 
to herein as the Index repository, which preferably represents all the ethnogeographic groups in the world. Haplotypes 
In the database preferably Include polymorphisms found In the promoter, exons, exon/lntron boundaries and the 5' and 
3' untranslated regions. Preferably, the number of Individuals examined In the Index repository allows the detection of 

10 any haplotype whose frequency Is 1 0% or higher with a 99% certainty. 

II. Informatics Computer Program 

[0214] The information within the Isogenomics™ Database is part of the HART'" Company's informatics computer 
is program which Is accesslblethrough an intuitive and logical user Interfaco. The Informatics program contains algorithms 
for the reconstruction of relationships among gene haplotypes and is capable of abstracting biological and evolutionary 
Information from the Isogenomics^" Database. The Informatics program is designed to analyze whether genes In the 
Isogenomics^" Database are relevant to a clinical phenotype, e.g., whether they correlate with an effective, Inadequate 
or toxic drug response. In a preferred embodiment, the program also contains algorithms designed for detecting clinical 
20 outcomes that are dependent upon cooperative Interactions among gene products. In this embodiment, the computer 
system has the capability to simulate gene Interactions that are likely to cause polygenic diseases and phenotypes 
such as drug response. The Informatics computer program will be installed at a site selected by each partner(s). The 
Information In the Isogenomics™ database will be of Immediate use to drug discovery teams for target validation and 
lead prioritization and optimization, to drug development specialists for design and Interpretation of clinical trials, and 
25 to marketing groups to address problems encountered by an approved drug In the marketplace. 

III. Cohort Haplotyping 

[0215] In one preferred embodiment, partner(s) can use the genotyping and/or haplotyping capabilities of the HAP™ 
30 Company to stratify their clinical cohorts, which will enable the partner(s) to separate cohorts by drug response. For a 
fixed fee per patient, the HAP™ Company will genotype and/or haplotype Phase II, Phase III, and Phase IV patient 
cohorts under good laboratory conditions (GLP) conditions that will allow submittal of the data to clinical regulatory 
authorities. Preferably, the clinical genotype and/or haplotype data Is deposited within a component of the Infomnatlcs 
computer program that Is proprietary to the partner to allow the partner to correlate polymorphisms such as gene 
35 haplotypes with drug response. 

Iv. Isogene Clones 

[0216] Partner(s) will have access to the physical clones that correspond to each of the haplotypes for a given gene 
40 or other locus. These Isogene clones can be used In primary or secondary screening assays and will provide useful 
Information on such pharmacological properties as drug binding, promoter strength, and functionality. 

V. Gene Selection by Partners 

45 [0217] The partners can select genes (or other loci) of their choosing for haplotyping In the index repository. The 
genes selected can be In the public domain or proprietary to the partner(s). In a preferred embodiment, haplotyping 
results for a proprietary gene will only be accessible by the owner of that gene until sequence information for the gene 
enters the public domain. 

50 vl. Patent Dossier 

[0218] In a preferred embodiment, the Isogenomics^" Database also contains public patent infomiatlon that is avail- 
able for each gene In the database. This feature provides the partner(s) with an understanding of the potential propri- 
etary status of any gene In the database. 

55 

vii. Committed Liaison 

[0219] In a preferred embodiment, the HAP™ Company will assign a Ph.D. level scientist as a liaison to a partner 
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to facilitate communication, technology transfer, and informatics support, 
viii. Special Services: cDNAs and Genomic Intervals 

5 [0220] In a preferred embodiment, the HAP™ Company will also provide, at an extra charge, special molecular, 
biological and genomics services to partner(s) who submit cDNAs or ESTs to be haplotyped. cDNAs or ESTs will be 
utilized to retrieve genomic loci and to create special haplotyping assays that will allow the gene locus at the chromo- 
some level to be haplotyped in the index repository. Genomic intervals containing possible genes of high significance 
for phenotypic correlations stemming from positional cloning programs can also be submitted by partner(s) for haplo- 

10 typing. 

b. Membership in the Partnership 

[0221] Each partner(s) will pay the H/KP^" Company a fee for membership in the Partnership, preferably for a period 
15 of at least two or three years. Companies joining the Partnership may utilize the resources of the informatics computer 
program and isogenomlcs™ Database on a company wide basis, including groups in drug discovery, medicinal chem- 
istry, clinical development, regulatory affairs, and marl<eting. 

c. Envisioned Outcomes From The Partnership 

20 

[0222] It is contemplated that novel Isogenes will be isolated and characterized by the HAP'''" Company, as well as 
methods for the detection of novel SNP's or haplotypes encompassed by the isogenes. 

[0223] It Is also contemplated that associations between clinical outcome and haplotypes (hereinafter "haplotype 
association") for many of the genes In the Isogenomlcs^" Database will be discovered. Therefore, it is also contem- 
ns plated that methods of using the haplotypes and/or isogenes for diagnostic or clinical purposes relating to disease 
Indications supported by the particular association will be discovered. 

[0224] It Is further contemplated there will be successful applications of the data and informatics tools for drug ap- 
proval and marketing. 

[0225] A number of different scenarios for using the database and/or analytical tools of the present invention may 
30 be envisioned. These include the following: 

1. A Partner selects a candidate gene or genes from the HAP'''" Company's database that is haplotyped. The 
Partner provides clinical cohorts for haplotype analysis and provides clinical response data for the cohorts. The 
H/KP™ Company performs haplotype analysis for the candidate gene(s) in the clinical cohorts, finds new haple- 
ss types, if any, and determines the association between one or more haplotypes and clinical response using the 

informatics computer program 

2. The Partner selects a candidate gene from the HAP'"" Company's database that is haplotyped. The Partner 
provides clinical cohorts for haplotype analysis. The HAP^^i Company docs haplotype analysis, finds new haplo- 
types, if any and sends the haplotype data to the Partner. The Partner delermines the association between hap- 

40 lotype and clinical response using the informatics computer program provided by the HAP™ company 

3. Like 1 above, but the Partner performs the haplotype analysis and determines the association between haplotype 
and clinical response. 

4. Like 2 above, but the Partner performs the haplotype analysis. 

5. A Partner provides one or more genes to the HAP™ Company for haplotype analysis. The HAP"'"'^ Company 
45 clones and characterizes isogenes for the gene(s), discovers new polymorphisms in the gene, if any and deter- 
mines the haplotypes for the gene(s). 

6. Based on polymorphisms observed in a gene or genes, a Partner sends the HAP^'^ Company clinical cohorts 
to haplotype and the Partner uses the haplotype data in conjunction with their own clinical response data to de- 
termine the association between haplotype and clinical response. 

so 7. A Partner sends the HAP™ Company a cDNA or an expressed sequence tag (EST). The HAP™ Company 

isolates and characterizes the gene con-esponding to the cDNA or EST. The HAP™ Company clones isogenes of 
the gene and determines the haplotypes embodied within the isogenes. 

[0226] A more detailed description of how the database and/or analytical tools of the present invention may be used 
55 in the context of clinical trials is set forth below. 

[0227] As a review, the standard routine procedure in premarketing development of a new drug to be used in humans 
is to conduct pre-clinical animal toxicology studies in two or more species of animals followed by three phases of clinical 
investigation as follows: Phase l-clinical pharmacology investigations with attention to phamiacokinetics, metabolism. 
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and both single dose and dose- range safety; Phase ll-linnited size closely monitored investigations designed to assess 
efficacy and relative safety; Phase Ill-full scale clinical investigations designed to provide an assessment of safety, 
efficacy optimum dose and more precise definition of drug-related adverse effects in a given disease or condition. In 
other words, Phase I and Phase II are the early stages of the drug's development, when the safety and the dosing 

5 level are tested in a small number of patients. Once the safety and some evidence that the drug is effective in treatment 
have been established, the drug's developer then proceeds to Phase III. In Phase III, many more patients, usually 
several hundred, are given the new drug to see whether the early findings that demonstrated safety and effectiveness, 
will be borne out in a larger number of patients. Phase III is pivotal to learning hard statistical facts about a new drug. 
Larger numbers of patients reveal the percentage of patients in which the drug is effective, as well as give doctors a 

10 clearer understanding about the side effects which may occur. 

[0228] In the research or discovery phase, a Partner's discovery personnel may desire haplotype information for 
isogenes of a gene, and/or one or more clones containing isogenes of the gene, regardless of whether or not clinical 
trials (or field trials, in the case of plants) are planned, in progress, or completed. For example, the Partner may be 
studying a gene (or its encoded protein) and by be interested in obtaining information concerning, e.g., protein structure 

15 or mRNA structure, in particular information concerning the location of polymorphisms in the mRNA structure and their 
possible effect on mRNA transcription, translation or processing, as well as their possible effect on the structure and 
function of the encoded protein. Such information may be useful in designing and/or interpreting the results of laboratory 
test results, such as in vitro or animal test results. Such information may be useful in correlating polymorphisms with 
a particular result or phenotype which may indicate that the gene is likely to be responsible for certain diseases, drug 

20 response or other trait. Such information could aid in drug design for phannaceutical use in humans and animals, or 
aid in selecting or augmenting plants or animals for desired traits such as increased disease or pest resistance, or 
increased fertility for agricultural or veterinary use. The Partner may also be Interested in l<nowing the frequency of 
the haplotypes. Such information may be used by the Partner to detennlne which haplotypes are present In the pop- 
ulation below a certain frequency, e.g., less than 5%, and the Partner may use this information to exclude studying the 

25 isogenes, mRNAs and encoded proteins for these haplotypes and may also use this information to weed out individuals 
containing these haplotypes from their proposed clinical trials. 

[0229] When information such as that described above is desired by a Partner, then the HAP™ Company may give 
access to the Partner to all or part of the data and/or analytical tools exemplified herein by the DecoGen™ Informatics 
Platform. The Partner may also be given access to one or more clones containing isogenes, e.g., a genome anthology 
30 clone (see, e.g., US Patent Application Ser. No. 60/032,645, filed December 10, 1996 and US Patent Application Ser. 
No. 08/987,966, filed December 10, 1997). 

[0230] During a Phase I clinical trial, which is being conducted to determine the safety of a drug (or drugs) in people, 
a Partner may desire haplotype information for haplotypes of a gene, and/or one or more clones containing isogenes 
of the gene, in particular when toxicity or adverse reactions to the drug are observed in at least some of the people 

35 taking the drug. In that case, the Partner may request that the HAP™ Company obtain, for each person experiencing 
toxicity or other adverse effect, the haplotypes for one or more genes which are suspected to be associated with the 
observed toxicity or adverse effect (e.g., a gene or genes associated with liverfailure) and determine whether there is 
a correlation between haplotype and the observed toxicity or adverse effect. If there is a correlation, then the Partner 
may decide to keep all people having the haplotype correlated with toxicity or other adverse effect out of Phase II 

40 clinical trials, or to allow such people to enter Phase II clinical trials, but be monitored more closely and/or given con- 
junctive therapy to modify the toxicity or other adverse effect. The HAP™ Company may provide a diagnostic test, or 
have such a test prepared, which will detect the people which have, or lack, the haplotype correlated with toxicity or 
other adverse effect. 

[0231] During a Phase II clinical trial, which is being conducted to determine the efficacy of a drug (or drugs) in 
45 people, a Partner may desire haplotype information for haplotypes of a gene, and/or one or more clones containing 
isogenes of the gene, in particular when the results of the trial are ambiguous. For example, the results of a Phase II 
clinical trial might indicate that 50% of the people given a drug were respondens (e.g., they lost weight in a trial for an 
anti-obesity drug, albeit to different degrees), 49.9% of people were non-responders (e.g., they did not lose any weight) 
and 0.1% had adverse effects. In such a case, the Partner may, for example, request that the HAP^*" Company obtain, 
50 for each of person in the Phase II clinical trial, the haplotypes for one or more genes which are suspected to be asso- 
ciated with the drug response. (In general, such gene(s) will be different from the gene associated with the adverse 
effect, but not necessarily.) A correlation may then be obtained between various haplotypes and the observed level of 
response to the drug. If a correlation is found, this information may be used to determine those individuals in which 
the drug will or will not be effective and, therefore, identify who should or should not get the drug. In addition, the 
55 information may also be used to develop a model (or test) which will predict, as a function of haplotype, how much of 
the drug should be used in an individual patient to get the desired result. Again, the HAP^'^ Company may provide a 
diagnostic test, or have such a test prepared, which will detect the people which have, or lack, the haplotype correlated 
with the efficacy or non-efficacy of the drug. 
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[0232] During Phase III clinical trials, which are being conducted to verify the safety and efficacy of a drug (or drugs) 
in people, a Partner may desire haplotype information for isogenes of a gene, and/or one or more clones containing 
isogenes of the gene, in particular to use at the beginning of the trial to design cohorts of patients (i.e., a group of 
individuals which will be treated the same). For example, the drug or placebo can be given to a group of people who 
5 have the same haplotype which is expected to be correlated with a good drug response, and the drug or placebo can 
be given to a group of people who have the same haplotype which is expected to be correlated with no drug response. 
The results of the trial will confirm whether or not the expected correlation between haplotype and drug response is 
correct. 

[0233] During "Phase IV," which involves monitoring of clinical results after FDA approval of a drug to obtain additional 
10 data concerning the safety and efficacy of a drug (or drugs) in people, a Partner may desire haplotype information for 
a gene, and/or one or more clones containing isogenes of the gene, in particular if additional adverse events (or hidden 
side effects) become apparent. In such a case, the methods described above can be used to identify people who are 
lil<ely to experience such adverse events. 

[0234] After clinical trials are successfully completed, a Partner may desire haplotype information for isogenes of a 
15 gene, and/or one or more isogene clones, in particular in the situation where the drug is what is l<nown as a "me too" 
drug, i.e., there are already a number of drugs on the market used to treat the disease or other condition which the 
Partner's drug is designed to treat. This can be used, e.g., as a marketing or business development tool for the Partner 
and/or help health care providers, such as doctors and HMOs, to keep drug costs down. For example, the haplotype 
Information and analytical tools of the Invention may be used to Identify the patients for which the Partner's drug will 
20 work and/or for whom the Partner's drug will be superior to (or cheaper than) the other drugs on the market. A test can 
be developed to identify the target patients. This test can be diagnostic for the condition (e.g., it could distinguish 
asthma from a respiratory infection) or it could be diagnostic for response to the drug. Preferably the doctor can perform 
the test In his office or other clinical setting and be able to prescribe the appropriate drug immediately, or after access 
to part or all of the database or analytical tools of the invention. This will also aid the doctor in that it may provide 
25 information about which drugs not to give, since they will not be effective in the patient. Again, this reduces costs for 
the patient and/or health care provider, and will likely accelerate the time in which the patient will receive effective 
treatment, since time may be saved by eliminating trial and error administrations of other drugs which would not be 
expected to work for the disease or condition manifested by the patient. 

[0235] If clinical trials are unsuccessfully completed, a Partner may desire haplotype information for isogenes, and/ 
30 or one or more isogene clones containing isogenes of the gene, to correlate drug response with haplotype and to use 
as an aid in designing an additional clinical trial (or trials), as discussed elsewhere herein. 

[0236] The database and analytical tools of the invention are envisioned to be useful in a variety of settings, including 
various research settings, pharmaceutical companies, hospitals, independent or commercial establishments. It is ex- 
pected users will include physicians (e.g., for diagnosing a particular disease or prescribing a particular drug) pharma- 
35 ceulical companies, generics companies, diagnostics companies, contract research organizations and managed care 

groups, including HMOs, and even patients themselves. 

[0237] However, as discussed above, it is obvious that various aspects of the invention may be useful in othersettings, 
such as in the agricultural and veterinary venues. 

[0238] The following examples illustrate certain embodiments of the present invention, but should not be construed 
40 as limiting its scope in any way. Certain modifications and variations will be apparent to those skilled in the art from 
the teachings of the foregoing disclosure and the following examples, and these are intended to be encompassed by 
the spirit and scope of the invention. 

2. Mednostics Program 

45 

[0239] The Mednostics™ program is a program in which one company, i.e., the HAP^" Company, uses mPTech- 
nology to analyze variation in response to drugs currently marketed by third parties, in the hope of conferring a com- 
petitive advantage on these companies. It is expected that this technology will provide pharmaceutical companies with 
information that could lead to the development of new indications for existing drugs, as well as second generation 
50 drugs designed to replace existing drugs nearing the end of their patent life. As a result, the Mednostics program will 
benefit pharmaceutical companies by allowing them to extend the patent life of existing drugs, revitalize drugs facing 
competition and expand their existing market. Entities such as HMOs and other third-party payers, as well as pharmacy 
benefit management organizations, may also benefit from the Mednostics program. 
[0240] The goals of the Mednostics^" program are to find HAP Markers that: 

55 

• identify individuals who are currently not undergoing therapy for a given disease yet are at risk and will respond 
well to a given drug. This application would be useful in markets that have high growth potential and involve con- 
ditions that are undertreated, such as many central nervous system disorders and cardiovascular disease; and 
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• identify individuals wlno will respond better to one drug within a competitive class tlian other drugs in the same 
class or to one competing class of drugs as compared to another class of drugs. This application would allow drugs 
that are not selling well to gain a greater market share and would be best applied to a drug that was not the first 
Introduced Into the market and is having difficulty gaining market share against the established competitors. Al- 
5 ternatively, if multiple drug classes are indicated forthe same disease, they could be differentiated by H^PMarkers, 

thus giving drugs within one class a competitive advantage over the other class. 

[0241] An example of the IVIednosticsT" program involves the statin class of drugs, which are used to treat patients 
with high cholesterol and lipid levels and who are therefore at riskfor cardiovascular disease. This is a highly competitive 

10 market with multiple approved products seeking to gain increased market share. For example, three of the most com- 
monly prescribed statins are pravastatin (sold by Bristol-Myers Squibb Company as Pravacol), atorvastatin (sold by 
Parke-Davis as Lipitor), and cerivastatin (sold by Bayer AG as Baycol). The statin market is currently approximately 
$1 billion worldwide and is forecasted to at least double in size by 2005. Identification of genetic markers that would 
allow the right drug to reach the right patient would allow a company to boost its market share and improve patient 

IS compliance, which are both particularly important factors when maximizing profit from drugs that are taken over the 
course of a lifetime. 

H. EXAMPLE 1 

20 SIMULATED CLINICAL TRIAL 

[0242] For illustration, we will use a particular example that shows how the CTS™ method works, and how the 
DecoGen™ application Is used, For this we have simulated a data set. Polymorphisms for the gene CYP2D6 were 
obtained from the literature. From those we constructed 10 haplotypes. A set of - individual subjects were created and 

25 assigned a value of the variable "Test" in the range from 0.0-1 .0. They were also assigned 2 of the haplotypes. This 
data set simulates what would come from a clinical trial in which patients were haplotyped and tested for some clinical 
variable. Most Individuals have a relatively low value of the Test measure, but a small number have a large value. This 
simulates the case where a small number of individuals taking a medication have an adverse reaction. Our goal is to 
find genetic markers (i.e. haplotypes) that are correlated with this adverse event. 

30 [0243] Step 1 . Identify candidate genes. CYP2D6 is the sample candidate gene. 

[0244] Step 2. Define a Reference Population. A standard population is used. An example Is the CEPH families and 
unrelated individuals whose cell lines are commercially available. (Source Coriell Cell Repositories, URL: http://locus. 
umdnj.edu/nigms/ceph/ceph.html) Coriell sells cell lines from the CEPH families (a standard set of families from the 
United States and France for which cells lines are available for multiple members from several generations from several 

35 families) and from individuals from other ethnogeographic groups. The CEPH families have been widely studied. The 
cell lines were originally collected by Foundation Jean DAUSSET (http://landru.cephb.fr/). 
[0245] Step 3. DNA from this reference population is obtained. 

[0246] Step 4, Haplotype individuals in the reference population. We use either direct or indirect haplotyping methods, 
or a combination of both, to obtain haplotypes for the CYP2D6 gene in the reference population. The polymorphic sites 
40 and nucleotide positions for these individuals are given in FIGURES 4A and 4B. 

[0247] Step 5. Get population averages and other statistics. The haplotypes and population distributions are shown 
using the DecoGen™ application in FIGURES 4A, 48, 1 0, and 1 1 . They are determined by the methods and equations 
described in Item 5 above. 

[0248] Step 6. Determine genotyping markers. By examining the linkage data (FIGURE 15) we see that all of the 
45 sites are tightly linked except 2 and 8, This indicates that this set should be a minimal set for genotyping. From this it 
was decided to genotype patients In the clinical trial at only these sites. 

[0249] Step 7. Recruit a trial population. In this case we use the reference population as the clinical population, 
having only added the simulated values of Test, 

[0250] Step 8. Treat, test and haplotype patients. All patients are measured for the Test variable. All of the patients 
50 were then genotyped at sites 2 and 8 (i.e. unphased haplotypes were found at these sites). Next their haplotypes are 

found directly (for those individuals who were totally homozygous or heterozygous at any one site) or inferred using 

maximum likelihood methods based on the observed haplotype frequencies in the reference population. 

[0251] Step 9. Find correlation's between haplotype pair and clinical outcome. We measure the value of Test. 

[0252] First we examine the results of the single site regression model (FIGURE 21) to determine to sites showing 
55 the strongest correlation with Test. From this we see that sites 2 and 8 have a strong correlation, at the 99% confidence 

level, 

[0253] The statistics for each of the sub-haplotype pair groups (using sites 2 and 8) is shown In FIGURES 18, 19, 
and 22. From this we see that individuals homozygous for TA at sites 2 and 8 have a high value of Test (average of 
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0.93). One conclusion we can make from this data is that patients homozygous forTA are likely to have an adverse 
reaction. Atypical haplotype pair distribution is shown in detail in FIGURE 20. 

[0254] We can use the ANOVA calculation to see whether grouping individuals by haplctype-pair (or sub-haplotype- 
pair) helps explain the observed variation in response in a statistically significant way. If ANOVA indicates that there 
5 is a significant group-to-group variation, then we can investigate this correlation further using the regression and clinical 
modeling tools. From FIGURE 23, we see that there is a significant level of group-to-group variation even at the 99% 
confidence level. This says that the haplotype-pair (or sub-haplotype-pair) that an individual has for this gene does 
have a significant impact on that individual's value of Test. 

[0255] Step 1 0. Follow-up trials are run. Additional trials should be run to accomplish 2 goals. The first would attempt 
10 to prove the correlation between being homozygous for haplotype TA and the high value of Test. One way to do this 
would be to enroll a group of subjects and break them into 4 cohorts. The first and second would be homozygous for 
TO. The second and third would have no copies of TC. The first and third group should take the medication causing 
the high value of Test and the second and fourth should take a placebo. The cohorts and their expected response are 
shown in the following matrix: 



IS 





Cohort 1 


Cohort 2 




TC/TC 


TC/TC 




Medication 


Placebo 


20 


Expectation: High value of Test 


Expectation: Low value of Test 




Cohort 3 


Cohort 3 




Not-TC/not-TC 


Not-TC/not-TC 




Medication 


Placebo 




Expectation: Low value of Test 


Expectation: Low value of Test 



[0256] If we see this pattern of response, then the link between TC homozygosity and high value of Test, the corre- 
lation is proven. 

[0257] Step 1 1 . Design a genotyping method to identify a relevant set of patients. Using the Genotype view tool in 
the DecoGen browser, we found that by genotyping individuals at sites 2 and 8 we could classify the group with high 
value of Test with 100% certainty. The results are shown in FIGURE 14. 

1. EXAMPLE 2 

1 ■ Provision Of Clinical Data 

[0258] DNA sequence Information for a cohort of normal subjects was obtained and entered Into the database as 
described previously For this example, 134 patients, all of whom came to the clinic having an asthmatic attack, were 
recruited. Each patient had a standard spirometry workup upon entering the clinic, was given a standard dose of 
albuterol, and was given afollowup spirometry workup 30 minutes later. Blood was drawn from each patient, and DNA 
was extracted from the blood sample for use in genotyping and haplotyping. Clinical data, in the form of the response 
of the asthmatic patients to a single dose of nebulized albuterol, was obtained from the asthmatic patients, as described 
previously (Yan, L., Galinsky, R.E., Bernstein, J.A., Liggett, S.B. & Weinshilboum, R.M. Pharmacogenetics, 2000, 10: 
261 -266)The clinical data was entered into the database, and displayed as in Fig. 298. 

2. Determination Of ADBR2 Genotypes And Haplotypes 

[0259] Haplotypes for ADBR2 were determined using a molecular genotyping protocol, followed by the computational 
HAPBuilder procedure (See U.S. patent application serial No, 60/198,340 (inventors: Stephens, et al.), filed April 18, 
2000). Comparison of the sequences resulted in the identification of thirteen polymorphic sites. 
[0260] The ADBR2 gene was selected from the screen shown in Fig. 26. The polymorphism and haplotype data for 
the ADBR2 gene among normal subjects was as displayed In Fig. 28. Only twelve different haplotypes were observed 
and/or inferred. Diplotype and haplotype data for the ADBR2 gene among the asthmatic patients was as displayed in 
Fig. 29A. 

[0261] The heterozygosity of individual patients at each polymorphic site was as displayed in Fig. 30. At each poly- 
morphic site (SNP), each patient has zero, one, or two copies of a given nucleotide. The same Is true of combinations 
of SNPs: for any collection of two or more SNPs (i.e., a haplotype or sub-haplotype), a patient will have zero, one, or 
two alleles having that particular combination of SNPs. 
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3. Correlation Of ADBR2 Haplotypes And Haplotype Pairs With Drug Response 

[0262] The measure of delta %FEV1 pred. was chosen as the clinical outcome value for which correlations with 
ADBR2 haplotypes were to be sought. 

5 

a. Build-Up Procedure (To 4 SNP Limit) 

[0263] Each Individual SNP was statistically analyzed for the degree to which it correlated with "delta %FEV1 pred." 
The analysis was a regression analysis, correlating the number of occun-ences of the SNP In each subject's genome 

10 {i.e. 0 J , or 2), with the value of "delta %FEV1 pred." 

[0264] "Cut-off" criteria were applied to each SNP in turn, as follows. In this example, a confidence limit of 0.05 was 
the default value for the tight cutoff, and a limit of 0,1 was the default value of the loose cutoff. The default values were 
automatically entered into the screen shown in Fig. 39A, in the two boxes labeled "Confidence". A SNP was then 
chosen from among the SNPs present in the population, and the p value calculated for correlation of this SNP with 

15 delta %FEV1 pred. was tested against the tight cutoff, if the value was .05 or less, the SNP and associated correlation 
data were stored for later calculations and for display in the screen shown in Fig. 39A. If the p value was between .05 
and 0.1 , the SNP and associated correlation data were stored without being displayed. Any SNP whose p value was 
greater than 0.1 was discarded, i.e., It was not considered further In the process. All thirteen ADBF12 SNPs were 
selected and tested In turn. The individual SNPs at positions 3 and 9 passed the tight cut-off; these were saved for 

20 display in Fig. 39A. In addition, the SNP at position 11 passed the loose cut-off and was saved without display 

[0265] Ail possible pair-wise combinations (sub-haplotypes) of the saved SNPs were then generated. The correla- 
tions of the newly generated two-SNP sub-hapiotypes with delta %FEV1 pred. were calculated by regression analysis, 
as was done for the individual SNPs. The correlation of each sub-haplotype was tested in turn, as described above, 
discarding any sub-haplotypes whose p-value did not pass the cut-off criteria and saving those that did pass, with those 

25 that passed the tight cut-off stored for display in the screen shown In Fig. 39A. The sub-haplotypes that passed the 
tight cut-off were "««*a*G**, "a^^'A**", and •»a*""G"; these were saved for display in Fig. 39A. No sub-haplotypes 
passed only the loose cut-off. 

[0266] When all the two-SNP sub-haplotypes had been examined all pair-wise combinations between originally 
saved SNPs and saved two-SNP subhaplotypes, and among the saved two-SNP sub-haplotypes, were generated. 

30 This produced a collection of three-SNP and four-SNP subhaplotypes. Again, correlations were calculated by regres- 
sion. A single three-SNP sub-haplotype, **a*****A*G**, passed the tight cut-off and was saved for display, and no 
four-SNP sub-haplotype passed. No sub-haplotypes passed only the loose cut-off. Combinations between the saved 
three-SNP sub-haplotypes and the saved SNPs generated four-SNP subhaplotypes, none of which passed the tight 
cut-off. No new combinations were possible within the default limit (four) to the number of SNPs permitted in the gen- 

35 erated sub-haplotypes. (See Fig. 39A, where "fixed site = 4" indicates the 4-SNP limit). 

[0267] The results of the build-up process are shown In Fig. 39A, where the SNPs and sub-haplotypes that passed 
the tight cut-off are displayed along with the results of the regression analyses. It was discovered that the three-SNP 
subhaplotype **a*****A*G** has a p-value nearly identical to that of the full haplotype. Figure 21 b shows the regression 
line (response as a function of number of copies of haplotype **A*****A*G**), Indicating that the more copies of this 

40 marl<er a patient has, the lower the response. 

b. Pare-Down Procedure (To 10 SNP Limit) 

[0268] Each of the twelve haplotypes observed for the ADBR2 gene is analyzed for the degree to which it correlates 
45 with the value of delta %FEV1 pred. by a regression analysis, correlating the number of occurrences of the haplotype 
in the subject's genome, i.e. 0. 1 , or 2, with the value of the clinical measurement. 

[0269] A "tight cut-off criterion is then applied to each haplotype in turn. A first haplotype is selected, and its correlation 
with delta %FEV1 pred. is tested against the tight cut-off of 0.05. If the value is .05 or less, the haplotype and associated 
correlation data are stored for later calculations and for display in the screen shown in Fig. 39A. If the p value is 
50 between .05 and 0.1, the haplotype and associated correlation data are stored as well but are not displayed. Any 
haplotype whose p value is greater than 0.1 is discarded, i.e., it is not considered further In the process. All twelve 
ADBR2 haplotypes are selected and tested In turn. 

[0270] From the saved haplotypes, all possible sub-haplotypes In which a single SNP is masked are generated by 
systematically masking each SNP of all saved haplotypes. The correlations of the newly generated sub-haplotypes 
55 with the clinical outcome value are calculated by regression, as was done for the haplotypes themselves. Each newly 
generated sub-haplotype is tested against the tight and loose cut-offs as described above forthe haplotype correlations, 
discarding sub-haplotypes that do not pass the cut-off criteria and saving those that do pass. 
[0271] When the first generation of sub-haplotypes, having a single SNP masked, has been tested, a second gen- 
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eration of sub-haplotypes having a two SNPs masked is generated from those of the first generation whose p-values 
passed the cut-offs. This is done, as before, by systematically masking each of the remaining SNPs. The p-values of 
the second generation of sub-haplotypes, having two SNPs masked, are tested, and from those that pass the cut-offs 
a third generation having three SNPs masked Is generated. 

5 

c. Cost Reduction 

[0272] The frequencies for each of the twelve haplotypes of the ADBR2 gene were calculated and were found to be 
as shown in Fig. 28A (eleven of the twelve haplotypes are visible). A list of all 78 genotypes that could be derived from 
10 the 12 observed haplotypes was generated. A portion of the list is shown in Fig. 32. The expected frequency of each 
of these genotypes from the Hardy-Weinberg equilibrium was calculated, and is shown in the third column under each 
population group. Linkage between the polymorphic sites was as shown in Fig. 33. 

[0273] A set of masks of the same length as the haplotype, i.e., thirteen sites in length, was created. A portion of the 
set of masks is shown in Fig, 34, along with a portion of the list of possible genotypes (haplotype pairs) which has been 
15 sorted by Hardy-Weinberg frequency, 

[0274] For each mask, an ambiguity score was calculated as follows: all pairs of genotypes [i,j] that were rendered 
identical by imposition of the mask were noted, and the geometric mean of their Hardy-Weinberg frequencies (^ and 
fj) was calculated. For each mask, all the geometric means of the frequencies of all the ambiguous pairs were added 
together, and the sum was multiplied by 10 to obtain the ambiguity score for that mask: 

20 

ambiguity score = 1 0^ yjfifj 



25 [0275] Ambiguity scores calculated in this manner are shown in Fig. 34 to the right of each of the displayed masks, 
along with the genotype pairs rendered ambiguous by the mask. (The genotype numbers refer to the row numbers in 
the first column of the sorted genotype list.) 

[0276] From the data visible in Fig. 34, it may be seen that one can mask sites 1, 6, 7, 8, and 10 (five of the thirteen 
polymorphic sites in the ADBR2 gene) with an ambiguity score of only 0.072. This mask (sixteenth mask from the top) 
30 renders four genotypes (sets of haplotype pairs) ambiguous, and three of the four ambiguities are between common 
and rare haplotype pairs. It is thus discovered that a savings of about 38% In the variable cost of haplotyping this gene 
can be achieved, simply by measuring eight rather than all thirteen known polymorphic sites, and that the complete 
haplotype can be inferred with high confidence from this smaller data set. 
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[0278] All references cited in this specification, including patents and patent applications, are hereby incorporated 
in their entirety by reference. The discussion of references herein is intended merely to summarize the assertions made 
by their authors and no admission is made that any reference constitutes prior art. Applicants reserve the right to 
is challenge the accuracy and pertinency of the cited references. 

[0279] Modifications of the above described modes for carrying out the invention that are obvious to those of sl<ill in 
the fields of chemistry, medicine, computer science and related fields are intended to be within the scope of the following 
claims. 
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1 . OLD CLAIM 59A method of determining polymorphic sites or sub-haplotypes that correlate with a clinical response 
or outcome of interest, comprising: 

(a) providing haplotype Information, and clinical response or outcome data (clinical outcome values) from a 
cohort of subjects; 

(b) statistically analyzing each individual SNP In the haplotype for the degree to which It correlates with the 
clinical outcome values, and generating a numerical measure of the degree of correlation; 

(c) saving for further processing those individual SNPs whose numerical measure of the degree of correlation 
with the clinical outcome values exceeds a first cut-off value; 

(d) generating all possible pair-wise combinations of the saved SNPs so as to provide a set of n-site sub- 
haplotypes where n = 2; 
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(e) statistically analyzing each newly generated n-site sub-haplotype for the degree to which it correlates with 
the clinical outcome values and calculating a numerical measure of the degree of correlation; 

(f) saving for further processing those n-site sub-haplotypes whose numerical measure of the degree of cor- 
relation with the clinical outcome values exceeds the first cut-off value; 

(g) generating all possible pair-wise combinations among and between the saved SIMPs and saved sub-hap- 
lotypes, to produce new subhaplotypes with increased values of n; 

(h) repeating steps (e) through (g) until either (i) no new sub-haplotypes can be generated, or (ii) no further 
sub-haplotypes having n less than a pre-selecled limit can be generated. 

The method of claim 1 , further comprising the step of displaying those saved SNPs and sub-haplotypes whose 
numerical measure of the degree of correlation with the clinical outcome value exceeds a second cut-off value, 
wherein the second cut-off value is greater than the first cut-off value. 

The method of claim 1 , wherein the numerical measure of degree of correlation is replaced by the p-value for the 
correlation, and SNPs and sub-haplotypes are saved if the p-value is less than a first cut-off value. 

The method of claim 3, further comprising the step of displaying those saved SNPs and sub-haplotypes whose p- 
value for the correlation with the clinical outcome value is less than a second cut-off value, wherein the second 
cut-off value is less than the first selected value. 

The method of any one of claims 1 -4, further comprising the step of excluding from further processing complex 
subhaplotypes which are constructed from smaller subhaplotypes, where the smaller sub-haplotypes each have 
correlation values that are at least as significant as that of the complex sub-haplotype. 

A method of detemnining polymorphic sites or sub-haplotypes that correlate with a clinical response or outcome 
of interest, comprising: 

(a) providing single gene haplotype Information for one or more genes, and clinical response or outcome data, 
from a cohort of subjects; 

(b) statistically analyzing each single gene haplotype for the degree to which it correlates with the clinical 
response or outcome of interest, and calculating a numerical measure of the degree of correlation; 

(c) saving for further processing those haplotypes whose numerical measure of the degree of correlation with 
the clinical response or outcome of interest exceeds a first selected value; 

(d) for each haplotype composed of m polymorphic sites, generating all possible sub-haplotypes having a 
single site maslced, so as to provide a set of sub-haplotypes having (m-n) sites, where n = 1 ; 

(e) statistically analyzing each newly generated sub-haplotype for the degree to which it correlates with the 
clinical response or outcome of interest, and calculating a numerical measure of the degree of correlation; 

(f) saving for further processing those sub-haplotypes whose numerical measure of the degree of correlation 
with the clinical response or outcome of interest exceeds the first selected value; 

(g) from the saved sub-haplotypes, generating all possible subhaplotypes having one additional site masked; 

(h) repeating steps (e) through (g) until either (i) no new sub-haplotypes have a degree of correlation which 
exceeds the first selected value, or (ii) no further sub-haplotypes having more unmasl<ed sites than a pre- 
selected limit can be generated. 

The method of claim 6, further comprising the step of displaying those saved sub-haplotypes whose numerical 
measure of the degree of correlation with the clinical response or outcome of interest exceeds a second selected 
value, wherein the second selected value is greater than the first selected value. 
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The method of claim 6, wherein the numerical measure of degree of correlation is replaced by the p-valueforthe 
correlation, and sub-haplotypes are saved if the p-value is less than a fi3st selected value. 

The method of claim 8, further comprising the step of displaying those saved sub-haplotypes whose p-value for 
the correlation with the clinical response or outcome of interest Is less than a second selected value, wherein the 
second selected value is less than the first selected value. 

The method of any one of claims 6-9, further comprising the step of excluding from further processing complex 
subhaplotypes which are constructed from smaller subhaplotypes, where each of the smaller sub-haplotypes has 
correlation values that are at least as significant as that of the complex sub-haplotype. 

OLD CLAIIW 11 OA computer-usable medium having computer-readable program code stored thereon, for causing 
a computer to determine polymorphic sites or sub-haplotypes that correlate with a clinical response or outcome 
of interest, or other phenotype, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to access a database containing haplotype In- 
formation, and clinical response or outcome data (clinical outcome values) or other phenotype data, from a 
cohort of subjects; 

(b) computer-readable program code for causing a computer to statistically analyze each Individual SNP in 
the haplotype for the degree to which it correlates with the clinical outcome values or other phenotype data, 
and generating a numerical measure of the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those individual 
SNPs whose numerical measure of the degree of correlation with the clinical outcome values or other pheno- 
type data exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
of the saved SNPs so as to provide a set of n-site sub-haplotypes where n = 2; 

(e) computer-readable program code for causing a computer to statistically analyze each newly generated n- 
site sub-haplotype for the degree to which it correlates with the clinical outcome values or other phenotype 
data, and calculate a numerical measure of the degree of con-elation; 

(f) computer-readable program code for causing a computer to store for further processing those n-site sub- 
haplotypes whose numerical measure of the degree of correlation exceeds the first cut-off value; 

(g) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
among and between the saved SNPs and saved sub-haplotypes, to produce newsubhaplotypes with increased 
values of n; 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (I) no 
new sub-haplotypes can be generated, or (ii) no further sub-haplotypes having n less than a pre-selected or 
user-selected limit can be generated. 

The computer-usable medium of claim 1 1 , which further comprises computer-readable program code stored ther- 
eon for causing a computer to display those saved SNPs and sub-haplotypes whose numerical measure of the 
degree of correlation with the clinical outcome value or other phenotype exceeds a second cut-off value, wherein 
the second cut-off value is greater than the first cut-off value. 

A computer-usable medium having computer-readable program code stored thereon, for causing a computer to 
detennine polymorphic sites or sub-haplotypes that correlate with a clinical response or outcome of interest, or 
other phenotype, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to access a database containing haplotype in- 
formation, and clinical response or outcome data (clinical outcome values) or other phenotype data, from a 
cohort of subjects; 
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(b) computer-readable program code for causing a computer to statistically analyze each individual SNP in 
the haplotype for the degree to which it correlates with the clinical outcome values or other phenotype data, 
and calculate the p-value for the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those individual 
SNPs whose p-value for the degree of correlation does not exceed a first cut-off value; 

(d) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
of the saved SNPs so as to provide a set of n-s\te sub-haplotypes where n = 2; 

(e) computer-readable program code for causing a computer to statistically analyze each newly generated n- 
site sub-haplotype for the degree to which it correlates with the clinical outcome values or other phenotype 
data, and calculate the p-value for the degree of correlation; 

(f) computer-readable program code for causing a computer to store for further processing those n-slte sub- 
haplotypes whose p-value for the degree of correlation does not exceed the first cut-off value; 

(g) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
among and between the saved SNPs and saved sub-haplotypes, to produce newsubhaplotypes with Increased 
values of n; 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (i) no 
new sub-haplotypes can be generated, or (II) no further sub-haplotypes having n less than a pre-selected or 

user-selected limit can be generated. 

The computer-usable medium of claim 11 , which further comprises computer-readable program code stored ther- 
eon for causing a computer to display those saved SNPs and sub-haplotypes whose p-value for the degree of 
correlation with the clinical outcome value or other phenotype does not exceed a second cut-off value, wherein 
the second cut-off value is less than the first cut-off value. 

The computer-usable medium of claims 11-14, which further comprises computer-readable program code stored 
thereon for causing a computer to exclude from further processing complex subhaplotypes which are constructed 
from smaller subhaplotypes, where the smaller sub-haplotypes each have correlation values that are at least as 
significant as that of the complex sub-haplotype. 

A computer-usable medium having computer-readable program code stored thereon, for causing a computer to 
determine polymorphic sites or sub-haplotypes that correlate with a clinical response or outcome of interest, or 
other phenotype of Interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to access a database containing single gene 
haplotype infomnation for one or more genes, and clinical response, outcome data, or other phenotype data 
from a cohort of subjects; 

(b) computer-readable program code for causing a computer to statistically analyze each single gene haplotype 
for the degree to which it correlates with the clinical response, outcome, or phenotype of interest, and to 
generate a numerical measure of the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those haplotypes 
whose numerical measure of the degree of correlation exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to generate, for each haplotype composed of m 
polymorphic sites, all possible subhaplotypes having a single site masl<ed, so as to provide a set of m-n site 
sub-haplotypes where n = 1 ; 

(e) computer-readable program code for causing a computer to statistically analyze each newly generated 
sub-haplotypeforthe degree to which it correlates with the clinical response, outcome, orphenotype of interest, 
and calculating a numerical measure of the degree of correlation; 
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(f) computer-readable program code for causing a computer to save for further processing those sub-haplo- 
types whose numerical measure of the degree of correlation exceeds the first cut-off value; 

(g) computer-readable program code for causing a computer to generate, from the saved sub-haplotypes, all 
possible sub-haplotypes having one additional site masl<ed; 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (i) no 
new sub-haplotypes have a degree of correlation which exceeds the first cut-off value, or (ii) no further sub- 
haplotypes having more unmasked sites than a pre-selected limit can be generated. 

17. The computer-usable medium of claim 1 6, which further comprises computer-readable program code stored ther- 
eon for causing a computer to display those saved subhaplotypes whose numerical measure of the degree of 
correlation with the clinical response data, outcome value, or other phenotype data exceeds a second cut-off value, 
wherein the second cut-off value is greater than the first cut-off value. 

18. A computer-usable medium having computer-readable program code stored thereon, for causing a computer to 
determine polymorphic sites or sub-haplotypes that correlate with a clinical response or outcome of interest, or 
other phenotype of interest, the computer-readable program code comprising: 

(a) computer-readable program code for causing a computer to access a database containing single gene 
haplotype information for one or more genes, and clinical response, outcome data, or other phenotype data 
from a cohort of subjects; 

(b) computer-readable program code for causing a computer to statistically analyze each single gene haplotype 
for the degree to which it correlates with the clinical response, outcome, or phenotype of interest, and to 
calculate the p-value for the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those haplotypes 
whose p-value for the degree of correlation does not exceed a first cut-off value; 

(d) computer-readable program code for causing a computer to generate, for each haplotype composed of m 
polymorphic sites, all possible subhaplotypes having a single site masked, so as to provide a set of m-n site 
sub-haplotypes where n = 1 ; 

(e) computer-readable program code for causing a computer to statistically analyze each newly generated 
sub-haplotypeforthe degree to which it correlates with the clinical response, outcome, or phenotype of interest, 
and calculating the p-value for the degree of correlation; 

(f) computer-readable program code for causing a computer to save for further processing those sub-haplo- 
types whose p-value for the degree of con-elation does not exceed the first cut-off value; 

(g) computer-readable program code for causing a computer to generate, from the saved sub-haplotypes, all 
possible sub-haplotypes having one additional site masked; 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (1) no 
new sub-haplotypes have a p-value which does not the first cut-off value, or (Ii) no further subhaplotypes 
having more unmasked sites than a pre-selected limit can be generated. 

19. The computer-usable medium of claim 18, which further comprises computer-readable program code stored ther- 
eon for causing a computer to display those saved subhaplotypes whose p-value for the degree of correlation with 
the clinical response, outcome, or phenotype of interest does not exceed a second cut-off value, wherein the 
second cut-off value is less than the first cut-off value. 

20. The computer-usable medium of claims 1 6-1 9, which further comprises computer-readable program code stored 
thereon for causing a computer to exclude from further processing complex sub-haplotypes which are constructed 
from smaller subhaplotypes, where the smaller sub-haplotypes each have correlation values that are at least as 
significant as that of the complex sub-haplotype. 
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OLD CLAIM 161 A computer programmed to determine polymorphic sites or sub-haplotypes that correlate with a 
clinical response or outcome of interest, or other phenotype, the computer comprising a memory having at least 
one region for storing computer executable program code and a processor for executing the program code stored 
in memory, wherein the program code includes: 

(a) computer-readable program code for causing a computer to access a database containing haplotype in- 
fomiation, and clinical response or outcome data (clinical outcome values) or other phenotype data, from a 
cohort of subjects; 

(b) computer-readable program code for causing a computer to statistically analyze each individual SNP in 
the haplotype for the degree to which it correlates with the clinical outcome values or other phenotype data, 
and generating a numerical measure of the degree of correlation; 

(o) computer-readable program code for causing a computer to store for further processing those individual 
SNPs whose numerical measure of the degree of correlation with the clinical outcome values or other pheno- 
type data exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
of the saved SNPs so as to provide a set of n-site sub-haplotypes where n = 2; 

(e) computer-readable program code for causing a computer to statistically analyze each newly generated n- 
site sub-haplotype for the degree to which it correlates with the clinical outcome values or other phenotype 
data, and calculate a numerical measure of the degree of con'elation; 

(f) computer-readable program code for causing a computer to store for further processing those n-site sub- 
haplotypes whose numerical measure of the degree of correlation exceeds the first cut-off value; 

(g) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
among and between the saved SNPs and saved sub-haplotypes, to produce newsubhaplotypes with increased 
values of n; 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (1) no 
new sub-haplotypes can be generated, or (ii) no further sub-haplotypes having n less than a pre-selected or 
user-selected limit can be generated. 

The computer of claim 21 , wherein the program code further includes computer-readable program code for causing 
a computerto display those saved SNPs and sub-haplotypes whose numerical measure of the degree of correlation 
with the clinical outcome value or other phenotype exceeds a second cut-off value, wherein the second cut-off 
value is greater than the first cut-off value. 

A computer programmed to detemiine polymorphic sites or sub-haplotypes that correlate with a clinical response 
or outcome of interest, or other phenotype, the computer comprising a memory having at least one region for 
storing computer executable program code and a processor for executing the program code stored in memory, 
wherein the program code includes: 

(a) computer-readable program code for causing a computerto access a database containing haplotype in- 
formation, and clinical response or outcome data (clinical outcome values) or other phenotype data, from a 
cohort of subjects; 

(b) computer-readable program code for causing a computer to statistically analyze each individual SNP in 
the haplotype for the degree to which it correlates with the clinical outcome values or other phenotype data, 
and calculate the p-value for the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those individual 
SNPs whose p-value for the degree of correlation does not exceed a first cut-off value; 

(d) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
of the saved SNPs so as to provide a set of n-site sub-haplotypes where n = 2; 
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(e) computer-readable program code for causing a computer to statistically analyze each newly generated n- 
site sub-haplotype for the degree to which it correlates with the clinical outcome values or other phenotype 
data, and calculate the p-vaiue for the degree of correlation; 

5 (f) computer-readable program code for causing a computer to store for further processing those n-site sub- 

haplotypes whose p-value for the degree of correlation does not exceed the first cut-off value; 

(g) computer-readable program code for causing a computer to generate all possible pair-wise combinations 
among and between the saved SNPs and saved sub-haplotypes, to produce newsubhaplotypes with increased 

10 values of n; 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (i) no 
new sub-haplotypes can be generated, or (11) no further sub-haplotypes having n less than a pre-selected or 
user-selected limit can be generated. 

24. The computer of claim 21 , wherein the program code further includes computer-readable program code for causing 
a computer to display those saved SNPs and sub-haplotypes whose p-value forthe degree of correlation with the 
clinical outcome value or other phenotype does not exceed a second cut-off value, wherein the second cut-off 
value is less than the first cut-off value. 

20 

25. The computer of any one of claims 21 -24, wherein the program code further includes computer-readable program 
code for causing a computer to exclude from further processing complex subhaplotypes which are constructed 
from smaller sub-haplotypes, where the smaller sub-haplotypes each have correlation values that are at least as 

significant as that of the complex sub-haplotype. 

25 

26. A computer programmed to determine polymorphic sites or sub-hapiotypes that correlate with a clinical response 
or outcome of interest, or other phenotype of interest, the computer comprising a memory having at least one 
region for storing computer executable program code and a processor for executing the program code stored in 
memory, wherein the program code includes: 

30 

(a) computer-readable program code for causing a computer to access a database containing single gene 
haplotype infonnation for one or more genes, and clinical response, outcome data, or other phenotype data 
from a cohort of subjects; 

35 (b) computer-readable program code for causing a computerto Statistically analyze each single gene haplotype 

for the degree to which it correlates with the clinical response, outcome, or phenotype of interest, and to 
generate a numerical measure of the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those hapiotypes 
40 whose numerical measure of the degree of correlation exceeds a first cut-off value; 

(d) computer-readable program code for causing a computer to generate, for each haplotype composed of m 
polymorphic sites, ail possible subhaplotypes having a single site masiced, so as to provide a set of m-n site 

sub-hapiotypes where n = 1 ; 

45 

(e) computer-readable program code for causing a computer to statistically analyze each newly generated 
sub-haplotype forthe degree to which it correlates with the clinical response, outcome, orphenotype of interest, 
and calculating a numerical measure of the degree of correlation: 

so (f) computer-readable program code for causing a computer to save for further processing those sub-haplo- 

types whose numerical measure of the degree of correlation exceeds the first cut-off value; 

(g) computer-readable program code for causing a computer to generate, from the saved sub-hapiotypes, all 
possible sub-haplotypes having one additional site masked; 

55 

(h) computer-readable program code for causing a computer to repeat steps (e) through (g) until either (1) no 
new sub-haplotypes have a degree of correlation which exceeds the first cut-off value, or (ii) no further sub- 
haplotypes having more unmasked sites than a pre-selected limit can be generated. 
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27. The computer of claim 26, wherein the program code further includes computer-readable program code for causing 
a computer to display those saved sub-haplotypes whose numerical measure of the degree of correlation with the 
clinical response data, outcome value, or other phenotype data exceeds a second cut-off value, wherein the second 
cut-off value Is greater than the first cut-off value. 

5 

28. A computer programmed to determine polymorphic sites or sub-haplotypes that correlate with a clinical response 
or outcome of Interest, or other phenotype of Interest, the computer comprising a memory having at least one 
region for storing computer executable program code and a processor for executing the program code stored in 
memory, wherein the program code includes: 

10 

(a) computer-readable program code for causing a computer to access a database containing single gene 
haplotype information for one or more genes, and clinical response, outcome data, or other phenotype data 
from a cohort of subjects: 

'5 (b) computer-readable program code forcauslngacomputerto statistically analyzeeachslnglegene haplotype 

for the degree to which it correlates with the clinical response, outcome, or phenotype of Interest, and to 
calculate the p-value for the degree of correlation; 

(c) computer-readable program code for causing a computer to store for further processing those haplotypes 
20 whose p-value for the degree of correlation does not exceed a first cut-off value; 

(d) computer-readabie program code for causing a computer to generate, for each haplotype composed of m 
polymorphic sites, all possible sub-haplotypes having a single site masked, so as to provide a set of m-n site 
sub-haplotypes where n = 1 ; 

25 

(e) computer-readable program code for causing a computer to statisticaily analyze each newiy generated 
sub-haplotypeforthe degree to which itcorrelateswiththeclinlcal response, outcome, orphenotype of Interest, 
and calculating the p-vaiue for the degree of correlation; 

30 (f) computer-readable program code for causing a computer to save for further processing those sub-haplo- 

types whose p-value for the degree of con'eiation does not exceed the first cut-off value; 

(g) computer-readable program code for causing a computer to generate, from the saved sub-haplotypes, all 
possible sub-haplotypes having one additional site masl<ed; 

35 

(h) computer-readabie program code for causing a computer to repeat steps (e) through (g) until either (i) no 
new sub-haplotypes have a p-value which does not the first cut-off value, or (ii) no further sub-hapiotypes 
having more unmasked sites than a pre selected limit can be generated. 

40 29. The computer of claim 28, wherein the program code further includes computer-readabie program code for causing 
a computer to display those saved sub-haplotypes whose p-value for the degree of correlation with the clinical 
response, outcome, or phenotype of Interest does not exceed a second cut-off value, wherein the second cut-off 
value is less than the first cut-off value. 

45 30. The computer of any one of claims 26-29, wherein the program code further includes computer-readabie program 
code for causing a computer to exclude from further processing complex sub-hapiotypes which are constructed 
from smaller sub-haplotypes, where the smaller sub-haplotypes each have correlation values that are at least as 
significant as that of the complex sub-haplotype. 
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^DecoGen ANOVA Modeler: Test 
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Legend of Figures: 
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Rectangle Boxes: Tables in the database. 



Rounded Boxes: Children tables that depend on their parrat tables. 
This dependency requires that a parent record to be in existence 
before a child record can be created. 

Identifying parent / child relationship. It depicts the not nullable 1-to- 
O-or-many relationship. 

Non-identifying parent / child relationship. It represents the nullable 

0- or-l -to-many relationship. 

Identifybg parent / child relationship. It depicts the not nullable 1 -to- 

1- or-many relationship. 

Non-identifying parent / child relationship. It represents the not 
nullable l-to-l-or-many relationship. 

Identifying parent / child relationship. It depicts the not nullable I-to- 
exact-1 relationship. 

Non-identifying parent / child relationship. It r^resents the nullable 
0-or-l-to-exact-l relationship. 

Non-identifying parent / child relationship. It represents the not 
nullable 0-or-l -to-many relationship. 
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Legend of Figures: 

ii^^'f Rectangle Boxes: Tables in the database. 



jsIlD(PK) l 

Rounded Boxes: Children tables that depend on their parent tables. 



This dependency requires that a parent record to be in existence before a child 
record can be created. 

2' J ck 

Identifying parent / child relationship. It depicts the not nullable 1-to- 
0-or-many relationship. 

4: ,h cL 

^ ^ Non-identifying parent / child relationship. It represents the nuUabje 

0- or- 1 -to-many relationship. 

6: J u 

^ Identifying parent / child relationship. It depicts the not nuUable 1 -to- 

1- or-many relationship. 

8. \ |€ Non-identifying parent / child relationship. It represents the not 
nullable 1-to-l-or-many relationship. 

^ Identifying parent / child relationship. It depicts the not nullable 1-to- 

exact-1 relationship. 

12: Non-identifying parent / child relationship. It represents the nullable 

0-or-l-to-exact-l relationship. 

^ Non-identifying parent / child relationship. It represents the not 

nullable O-or- 1 -to-many relationship. 
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